机器学习算法比较-Python

机器学习算法比较-Python

在机器学习领域,有许多功能算法,回归算法,分类算法。在这篇文章中,我将尝试使用乳腺癌数据集以简单的方式比较一些分类算法来进行预测。

首先导入库和数据集:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

#url = "https://raw.githubusercontent.com/MateLabs/Public-Datasets/master/Datasets/diabetes.csv"

diabetes_df = pd.read_csv("diabetes.csv")

names = ["times_pregnant", "glucose_concentration", "blood_pressure", "skin_fold", "serum_insulin", "body_mass",

"diabetes_pedigree", "age", "class"]

diabetes_df.columns = names

diabetes_df.head()

数据集的前五行:

机器学习算法比较-Python

我们的最后一列“类”将是“y”向量,我们处理此列将“positive”值转换为1,将“negative”值转换为0

diabetes_df['class'][diabetes_df['class'] == 'positive'] = 1

diabetes_df['class'][diabetes_df['class'] == 'negative'] = 0

diabetes_df.head()

看起来像这样:

机器学习算法比较-Python

我们来寻找一些描述性的统计数据:

print("Size: {}".format(diabetes_df.shape))

print(diabetes_df.describe())

diabetes_df.hist()

plt.show()

机器学习算法比较-Python

机器学习算法比较-Python

在没有最后一列的情况下,我们的数据集每一列的正态分布。

breast_norm = diabetes_df.iloc[:, 0:8]

for column in breast_norm.columns:

mu, sigma = breast_norm[column].mean(), breast_norm[column].std()

s = np.random.normal(mu, sigma, breast_norm[column].count())

count, bins, ignored = plt.hist(s, 30, density=True)

plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *

np.exp( - (bins - mu)**2 / (2 * sigma**2) ),

linewidth=2, color='r')

plt.xlabel(breast_norm[column].name)

plt.ylabel("Probability")

plt.show()

机器学习算法比较-Python

机器学习算法比较

首先,我们将数据集划分为“X”和“y”,分成训练和测试集,然后我们通过删除均值和缩放到单位方差来归一化X的特征

X = diabetes_df.iloc[:,0:8]

y = diabetes_df.iloc[:,8]

y = y.astype('int')

#Split into train and test set using sklearn model_selection

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7)

# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

下一步是导入库并使用我们的模型构建列表

seed = 7

scoring = 'accuracy'

#importing ml and evaluation libraries

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

#Building the list of model

models = []

results = []

models.append(('LR', LogisticRegression()))

models.append(('LDA', LinearDiscriminantAnalysis()))

models.append(('KNN', KNeighborsClassifier()))

models.append(('CART', DecisionTreeClassifier()))

models.append(('NB', GaussianNB()))

models.append(('SVM', SVC()))

让我们评估一下估算器的性能

for name, model in models:

kfold = KFold(n_splits=10, random_state=seed)

cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)

results.append(cv_results)

names.append(name)

msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())

print(msg)

机器学习算法比较-Python

正如我们所看到的,逻辑回归具有最好的性能,所以我们的选择是适应训练和测试集,并开始进行预测。

#Making predictions

cls = LogisticRegression(random_state=7)

cls.fit(X_train,y_train)

y_pred = cls.predict(X_test)

最后一步,我们将预测与实际情况进行比较

print(accuracy_score(y_test,y_pred))

print(confusion_matrix(y_test,y_pred))

print(classification_report(y_test,y_pred))

机器学习算法比较-Python

最终的想法

我们知道,大多数情况下,ML算法的性能取决于我们了解如何探索、解释和管理数据的能力。但能将这些数据进行对比,就能让我们了解最佳算法是如何使用的。

相关推荐