机器学习算法比较-Python
在机器学习领域,有许多功能算法,回归算法,分类算法。在这篇文章中,我将尝试使用乳腺癌数据集以简单的方式比较一些分类算法来进行预测。
首先导入库和数据集:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#url = "https://raw.githubusercontent.com/MateLabs/Public-Datasets/master/Datasets/diabetes.csv"
diabetes_df = pd.read_csv("diabetes.csv")
names = ["times_pregnant", "glucose_concentration", "blood_pressure", "skin_fold", "serum_insulin", "body_mass",
"diabetes_pedigree", "age", "class"]
diabetes_df.columns = names
diabetes_df.head()
数据集的前五行:
我们的最后一列“类”将是“y”向量,我们处理此列将“positive”值转换为1,将“negative”值转换为0
diabetes_df['class'][diabetes_df['class'] == 'positive'] = 1
diabetes_df['class'][diabetes_df['class'] == 'negative'] = 0
diabetes_df.head()
看起来像这样:
我们来寻找一些描述性的统计数据:
print("Size: {}".format(diabetes_df.shape))
print(diabetes_df.describe())
diabetes_df.hist()
plt.show()
在没有最后一列的情况下,我们的数据集每一列的正态分布。
breast_norm = diabetes_df.iloc[:, 0:8]
for column in breast_norm.columns:
mu, sigma = breast_norm[column].mean(), breast_norm[column].std()
s = np.random.normal(mu, sigma, breast_norm[column].count())
count, bins, ignored = plt.hist(s, 30, density=True)
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
np.exp( - (bins - mu)**2 / (2 * sigma**2) ),
linewidth=2, color='r')
plt.xlabel(breast_norm[column].name)
plt.ylabel("Probability")
plt.show()
机器学习算法比较
首先,我们将数据集划分为“X”和“y”,分成训练和测试集,然后我们通过删除均值和缩放到单位方差来归一化X的特征
X = diabetes_df.iloc[:,0:8]
y = diabetes_df.iloc[:,8]
y = y.astype('int')
#Split into train and test set using sklearn model_selection
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=7)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
下一步是导入库并使用我们的模型构建列表
seed = 7
scoring = 'accuracy'
#importing ml and evaluation libraries
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
#Building the list of model
models = []
results = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
让我们评估一下估算器的性能
for name, model in models:
kfold = KFold(n_splits=10, random_state=seed)
cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
正如我们所看到的,逻辑回归具有最好的性能,所以我们的选择是适应训练和测试集,并开始进行预测。
#Making predictions
cls = LogisticRegression(random_state=7)
cls.fit(X_train,y_train)
y_pred = cls.predict(X_test)
最后一步,我们将预测与实际情况进行比较
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
最终的想法
我们知道,大多数情况下,ML算法的性能取决于我们了解如何探索、解释和管理数据的能力。但能将这些数据进行对比,就能让我们了解最佳算法是如何使用的。