使用网格搜索（Grid Search）进行机器学习模型调整

一条鱼

2018-12-30

模型超参数是模型外部的一个特征，它的值不能从数据中估计。超参数的值必须在学习过程开始之前设置。例如，支持向量机中的c，最近邻中的k，神经网络中的隐藏层数。

与此相反，参数是模型的一个内在特征，它的值可以通过数据来估计。例如，线性/逻辑回归的beta系数或者支持向量机的支持向量。

网格搜索用于查找模型的最佳超参数，从而产生最“准确”的预测。

让我们通过在 Breast Cancer数据集上构建机器学习分类模型来查看网格搜索（https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29）。

1.导入机器学习数据集并查看前10行，Python实现如下：

#import libraries
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
os.chdir('C:\Users\rohan\Documents\Analytics\Data')
#import data
data = pd.read_csv('breast-cancer-wisconsin.csv',header=None)
#set column names
data.columns = ['Sample Code Number','Clump Thickness','Uniformity of Cell Size',
 'Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size',
 'Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
#view top rows
data.head(10)

使用网格搜索（Grid Search）进行机器学习模型调整

输出：

使用网格搜索（Grid Search）进行机器学习模型调整

数据集中的每一行都有两个可能的类别之一：良性（由2表示）和恶性（由4表示）。此外，此机器学习数据集中有10个属性（如上所示）将用于预测，但代码编号（id number）除外。

2. 清理数据并将类值重命名为0/1，以便进行机器学习模型构建(其中1表示恶性情况)。同时，让我们观察类的分布。Python实现如下：

#Data cleaning
data = data.drop(['Sample Code Number'],axis=1) #Drop 1st column
data = data[data['Bare Nuclei'] != '?'] #Remove rows with missing data
data['Class'] = np.where(data['Class'] ==2,0,1) #Change the Class representation
#Distribution of Class
data['Class'].value_counts()

使用网格搜索（Grid Search）进行机器学习模型调整

输出：

使用网格搜索（Grid Search）进行机器学习模型调整

良性病例444例，恶性病例239例。

3.在构建分类模型之前，我们先构建一个虚拟分类器来确定“基线”性能。

#Split data into attributes and class
X = data.drop(['Class'],axis=1)
y = data['Class']
#perform training and test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
#Dummy Classifier
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy= 'most_frequent').fit(X_train,y_train)
y_pred = clf.predict(X_test)
#Distribution of y test
print('y actual : 
' + str(y_test.value_counts()))
#Distribution of y predicted
print('y predicted : 
' + str(pd.Series(y_pred).value_counts()))

使用网格搜索（Grid Search）进行机器学习模型调整

输出：

使用网格搜索（Grid Search）进行机器学习模型调整

从输出中，我们可以观察到测试数据集中有68个恶性病例和103个良性病例。但是，我们的分类器将所有情况预测为良性（因为它是多数类）。

4.计算此机器学习模型的评估指标，Python代码如下：

# Model Evaluation metrics 
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))
#Dummy Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : 
' + str(confusion_matrix(y_test,y_pred)))

使用网格搜索（Grid Search）进行机器学习模型调整

输出：

使用网格搜索（Grid Search）进行机器学习模型调整

模型的准确度为60.2％，但这种情况下,准确度可能不是评估模型的最佳指标。那么，我们来看看其他评估指标。

使用网格搜索（Grid Search）进行机器学习模型调整

上图是混淆矩阵，添加了标签和颜色以获得更好的直觉。总结混淆矩阵：TRUE POSITIVES（TP）= 0，TRUE NEGATIVES（TN）= 103，FALSE POSITIVES（FP）= 0，FALSE NEGATIVES（FN）= 68.评估指标的公式如下：

使用网格搜索（Grid Search）进行机器学习模型调整

由于该模型没有正确地对任何恶性病例进行分类，因此召回和精度指标为0。

5.现在我们已经有了基准精度，让我们用默认参数构建Logistic回归模型并评估模型。Python实现如下：

#Logistic regression
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X_train,y_train)
y_pred = clf.predict(X_test)
# Model Evaluation metrics 
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))
#Logistic Regression Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : 
' + str(confusion_matrix(y_test,y_pred)))

使用网格搜索（Grid Search）进行机器学习模型调整

输出：

使用网格搜索（Grid Search）进行机器学习模型调整

通过使用默认参数拟合Logistic回归模型，我们得到了一个“更好”的模型。准确度为94.7％，同时精度达到惊人的98.3％。现在，让我们再次看一下这个模型结果的混淆矩阵：

使用网格搜索（Grid Search）进行机器学习模型调整

观察错误分类的情况，我们可以观察到8例恶性病例被错误地归类为良性（假阴性）。此外，只有一个良性病例被归类为恶性病例（假阳性）。

让我们尝试使用网格搜索来最小化错误否定，以找到最佳参数。网格搜索可用于改进任何特定的评估指标。

我们需要关注的减少漏报的指标是召回。

6.网格搜索以最大化Recall

from sklearn.model_selection import GridSearchCV
clf = LogisticRegression()
grid_values = {'penalty': ['l1', 'l2'],'C':[0.001,.009,0.01,.09,1,5,10,25]}
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall')
grid_clf_acc.fit(X_train, y_train)
print(grid_clf_acc.best_estimator_)
y_pred_acc = grid_clf_acc.predict(X_test)
# Model Evaluation metrics 
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))
#Dummy Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : 
' + str(confusion_matrix(y_test,y_pred_acc)))

使用网格搜索（Grid Search）进行机器学习模型调整

输出：

使用网格搜索（Grid Search）进行机器学习模型调整

我们调整的超参数是：

Penalty：l1或l2是惩罚中使用的规范。
C：正则化强度的反向 - 较小的C值指定更强的正则化。

此外，在网格搜索功能中，我们有评分参数，我们可以在其中指定评估模型的指标（我们选择召回作为指标）。从下面的混淆矩阵中，我们可以看到假阴性的数量已经减少，但是，这是以增加误报为代价的。网格搜索后的召回率从88.2％跃升至91.1％，而精确度则从98.3％下降至87.3％。python代码如下：

#Function to plot intuitive confusion matrix
import itertools
def plot_confusion_matrix(cm, classes,
 normalize=False,
 title='Confusion matrix',
 cmap=plt.cm.Blues):
 """
 This function prints and plots the confusion matrix.
 Normalization can be applied by setting `normalize=True`.
 """
 if normalize:
 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
 print("Normalized confusion matrix")
 else:
 print('Confusion matrix, without normalization')
 print(cm)
 plt.imshow(cm, interpolation='nearest', cmap=cmap)
 plt.title(title)
 plt.colorbar()
 tick_marks = np.arange(len(classes))
 plt.xticks(tick_marks, classes, rotation=45)
 plt.yticks(tick_marks, classes)
 fmt = '.2f' if normalize else 'd'
 thresh = cm.max() / 2.
 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
 plt.text(j, i, format(cm[i, j], fmt),
 horizontalalignment="center",
 color="white" if cm[i, j] &gt; thresh else "black")
 plt.ylabel('True label')
 plt.xlabel('Predicted label')
 plt.tight_layout()
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred_acc)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plt.figure()
class_names = [0,1]
plot_confusion_matrix(cnf_matrix, classes=class_names,
 title='Confusion matrix - Logistic Regression + Grid Search')
a = plt.gcf()
a.set_size_inches(8,4)
plt.show()

使用网格搜索（Grid Search）进行机器学习模型调整

Confusion matrix, without normalization

[[94 9]

[ 6 62]]

使用网格搜索（Grid Search）进行机器学习模型调整

您可以使用“f1”得分作为评估指标，进一步调整模型以在精确度和召回之间取得平衡。

网格搜索为指定的每个超参数组合构建模型，并评估每个机器学习模型。用于超参数调整的更有效的技术是随机化搜索 - 其中使用超参数的随机组合来找到最佳解决方案。

机器学习 python机器学习 data

安科网

使用网格搜索（Grid Search）进行机器学习模型调整

一条鱼

1.导入机器学习数据集并查看前10行，Python实现如下：

2. 清理数据并将类值重命名为0/1，以便进行机器学习模型构建(其中1表示恶性情况)。同时，让我们观察类的分布。Python实现如下：

3.在构建分类模型之前，我们先构建一个虚拟分类器来确定“基线”性能。

4.计算此机器学习模型的评估指标，Python代码如下：

5.现在我们已经有了基准精度，让我们用默认参数构建Logistic回归模型并评估模型。Python实现如下：

6.网格搜索以最大化Recall

一条鱼

相关推荐

关于机器学习管道需要了解什么?

Python 用5行代码学机器学习—线性回归

如果不能用Python执行机器学习，那该用什么呢？

这里有你需要的5个机器学习必备技能

在云端你需要的5个机器学习技能

TensorFlow为新旧Mac特供新版本，速度最高提升7倍

如何通过7个步骤构建机器学习模型

机器学习新风暴：如何用ML模型预测房价？

为什么所有的机器学习模型有90％从没有投入生产

LinkedIn开源Dagli，发布Java机器学习函数库

关于机器学习算法的16个技巧

全面解读谷歌云人工智能如何为机器学习提供帮助

关于感知器的故事：机器学习是如何发展到如今这一程度的呢？

需要知识的后深度学习时代，如何高效自动构建知识图谱

10个丰富自我的机器学习项目

机器学习如何颠覆金融行业

24个提高知识和技能极限的机器学习项目

IT自动化和人工智能将在2021年走向何方？

无监督机器学习的重要指南

机器学习概念和经典算法，我用大白话给你讲清楚了！入门必看

一条鱼