超参数调整说明-调整阶段,调整方法,贝叶斯优化和示例代码
超参数是机器学习模型的重要组成部分,在这篇文章中,我们将讨论:
- 机器学习建模中的调整阶段
- 机器学习模型(尤其是GBDT模型)的重要参数,
- 常见的四种调整方法(手动/网格搜索/随机搜索/贝叶斯优化)。
1.通用超参数调整策略
1.1特征工程的参数整定分为三个阶段
我们应该记住以下常见步骤:
- 初始阶段:启动基线参数和基线特征工程
- 预热阶段:使用一些搜索候选项对一些重要参数进行手动调优或网格搜索
- 调整阶段:更多参数的随机搜索或贝叶斯优化,最终特征工程
1.2什么是超参数基线,哪些参数值得调整?
然后,您将遇到另一个问题:“超参数基线是什么,哪些参数值得调整?”
每个机器学习模型的参数是不同的,因此我不能在这里讨论每个模型的参数。注意参数的选择一直是数据科学家的工作。
在这篇文章中,我将重点关注GBDT模型,xgboost、lightbgm和catboost,这些模型是用来讨论的入门模型。
下面的图表是一个总结:
三种GBDT模型的重要超参数列表,它们的基线选择和调整范围
使用Python包对GBDT进行建模的人通常选择原始函数版本(' original API ')或 sklearn API。大多数情况下,您可以根据自己的喜好进行选择,但是要记住,除了catboost包之外,原始API和sklearn API可能有不同的参数名称,即使它们表示相同的参数。
2.超参数调整的四种基本方法
#1手动调整
通过手动调整,根据当前参数的选择及其评分,对部分参数进行修改,再次对机器学习模型进行训练,并检查评分的差异,在参数的选择过程中不自动改变参数值。
手动调整的优点是:
- 您可以记住超参数的行为,并在另一个项目中使用。因此,我建议至少对主要模型进行一次手动调优。
缺点是:
- 需要手工作业。
- 可能会过多地考虑分数的意外变化,而不去尝试和检查它是否是广义变化。
手动调整的示例:
- 当发现有太多无用的变量输入模型时,您将增加正则化参数的权重。
- 当你认为模型中没有考虑很多变量的相互作用时,你可以增加拆分数量(GBDT情况)。。
您可能会说,如果手动调优远不是获得全局最佳参数的最佳方法,那么我们为什么要进行手动调优呢?在实践中,在早期阶段使用这种方法可以很好地了解对超参数更改的敏感性,也可以在最后阶段进行调优。
令人惊讶的是,许多顶级高手都更喜欢使用手动调整来进行网格搜索或随机搜索。
#2网格搜索
网格搜索是这样一种方法,我们从准备候选超参数集开始,为每个候选超参数集训练模型,并选择性能最好的超参数集。
设置参数和评估通常是通过支持库自动完成的,比如sklearn.model_selection的GridSearchCV。
这种方法的优点是:
- 您可以涵盖所有可能的预期参数集。
缺点是:
- 一次运行一个超参数集需要花费一些时间。整个参数集的运行时间可能很长,因此要探索的参数数量有实际限制。
Python代码示例
# lightgbm sklearn API ver. from lightgbm import LGBMRegressor # importing GridSearchCV. from sklearn.model_selection import train_test_split, GridSearchCV import pandas as pd import numpy as np # importing some dataset and prepare train/test data for sklearn functions. df = pd.read_csv('data.csv',index_col=0) y = df['Value'] X = df.drop(['Value'],axis=1) X_train0, X_test, y_train0, y_test = train_test_split(X,y,test_size=0.2, random_state=1111) # Proportion of validation set for early stopping in training set. r = 0.1 trainLen = round(len(X_train0)*(1-r)) # Splitting training data to training and early stopping validation set. X_train = X_train0.iloc[:trainLen,:] y_train = y_train0[:trainLen] X_val = X_train0.iloc[trainLen:,:] y_val = y_train0[trainLen:] # Defining parameter space for grid search. gridParams = { 'max_depth': [3, 5, 7, 9], 'subsample': [0.6, 0.8, 1.0], 'colsample_bytree': [0.6, 0.8, 1.0], 'min_child_weight': [0.1, 1.0, 2.0], } # Define lightgbm and grid search. reg = LGBMRegressor(learning_rate=0.1, n_estimators=1000, random_state=1000) reg_gridsearch = GridSearchCV(reg, gridParams, cv=5, scoring='r2', n_jobs=-1) # Model fit with early stopping. reg_gridsearch.fit(X_train, y_train, early_stopping_rounds=100, eval_set=(X_val,y_val)) ## Final l2 was l2: 0.0203797. # Confirm what parameters were selected. reg_gridsearch.best_params_ ##{'colsample_bytree': 0.6, ## 'max_depth': 9, ## 'min_child_weight': 0.1, ## 'subsample': 0.6}
#3随机搜索
随机搜索是一种像网格搜索一样准备候选超参数集的方法,而超参数集则是从准备好的超参数搜索空间中随机选取。根据我们搜索超参数的次数来随机选择、模型训练和评估。最后,选择性能最好的超参数集。
我们可以通过分配参数的密度函数而不是特定的值来控制随机性,例如均匀分布或正态分布。
通常通过支持库(例如RandomizedSearchCVof)自动完成参数设置和评估sklearn.model_selection。
使用随机搜索的优点是:
- 您不必担心运行时间,因为您可以控制参数搜索的次数。
缺点是:
- 最终选择的超参数集可能不是搜索范围内的最佳值。
- 根据搜索的数量和参数空间的大小,有些参数可能没有得到足够的研究。
Python示例
# lightgbm sklearn API ver. from lightgbm import LGBMRegressor # importing GridSearchCV. from sklearn.model_selection import train_test_split, RandomizedSearchCV # used in declaration of distribution of parameters. import scipy.stats as stats import pandas as pd import numpy as np # importing some dataset and prepare train/test data for sklearn functions. df = pd.read_csv('data.csv',index_col=0) y = df['Value'] X = df.drop(['Value'],axis=1) X_train0, X_test, y_train0, y_test = train_test_split(X,y,test_size=0.2, random_state=1111) # Proportion of validation set for early stopping in training set. r = 0.1 trainLen = round(len(X_train0)*(1-r)) # Splitting training data to training and early stopping validation set. X_train = X_train0.iloc[:trainLen,:] y_train = y_train0[:trainLen] X_val = X_train0.iloc[trainLen:,:] y_val = y_train0[trainLen:] # Defining parameter space for grid search. randParams = { 'max_depth': stats.randint(3,13), # integer between 3 and 12 'subsample': stats.uniform(0.6,1.0-0.6), # value between 0.6 and 1.0 'colsample_bytree': stats.uniform(0.6,1.0-0.6), # value between 0.6 and 1.0 'min_child_weight': stats.uniform(0.1,10.0-0.1), # value between 0.1 and 10.0 } # Define lightgbm and grid search. Find n_iter and random_state were added to searchCV function parameters. reg = LGBMRegressor(learning_rate=0.1, n_estimators=1000, random_state=1000) reg_randsearch = RandomizedSearchCV(reg, randParams, cv=5, n_iter=20, scoring='r2', n_jobs=-1, random_state=2222) # Model fit with early stopping. reg_randsearch.fit(X_train, y_train, early_stopping_rounds=100, eval_set=(X_val,y_val)) ## Final l2 was l2: 0.0212662. # Confirm what parameters were selected. reg_randsearch.best_params_ ##{'colsample_bytree': 0.6101850277033293, ## 'max_depth': 7, ## 'min_child_weight': 8.263738852474235, ## 'subsample': 0.9167268345677564}
#4贝叶斯优化
在贝叶斯优化中,它是基于贝叶斯方法从随机开始并缩小搜索空间。
如果您知道贝叶斯定理,就可以理解了,它只是通过开始随机搜索将关于可能的超参数的信念的先验分布更新为后验分布。
贝叶斯优化方法的优点是:
- 搜索可能有效(但不一定)。
缺点是:
- 可能陷入局部最优。
有两个常见的Python库可以进行贝叶斯优化,hyperopt和optuna。还有有其他的,例如gpyopt,spearmint,scikit-optimize。
下面是使用hyperopt的Python示例代码
# import hyperopt-related methods. from hyperopt import hp, fmin, tpe, STATUS_OK, Trials # lightgbm sklearn API ver. from lightgbm import LGBMRegressor # Score used in optimization from sklearn.metrics import r2_score from sklearn.model_selection import train_test_split, KFold import pandas as pd import numpy as np # importing some dataset and prepare train/test data for sklearn functions. df = pd.read_csv('data.csv',index_col=0) y = df['Value'] X = df.drop(['Value'],axis=1) X_train0, X_test, y_train0, y_test = train_test_split(X,y,test_size=0.2, random_state=1111) # Proportion of validation set for early stopping in training set. r = 0.1 trainLen = round(len(X_train0)*(1-r)) # Splitting training data to training and early stopping validation set. X_train = X_train0.iloc[:trainLen,:].reset_index(drop=True) y_train = y_train0[:trainLen].reset_index(drop=True) X_val = X_train0.iloc[trainLen:,:] y_val = y_train0[trainLen:] # Preparing CV folds for cross validation. kf = KFold(n_splits=5, shuffle=True, random_state=3333) # Define score function to be minimized in Bayesian optimization. # This case I chose average r2 score upon validation folds but should be determined up to your purpose of modeling. def score(params): reg = LGBMRegressor(learning_rate=0.1, n_estimators=1000, random_state=1000,**params) r2_res = [] for train_index, val_index in kf.split(X_train): X_train_kf = X_train.iloc[train_index,:] X_val_kf = X_train.iloc[val_index,:] y_train_kf = y_train[train_index] y_val_kf = y_train[val_index] reg.fit(X_train_kf, y_train_kf, early_stopping_rounds=100, eval_set=(X_val,y_val),verbose=False) r2_res += [r2_score(y_val_kf,reg.predict(X_val_kf))] score = -np.mean(r2_res) # hyperopt takes minimization problem, therefore higher-is-better score like r2 needs to be negative. history.append((params, score)) return {'loss': score, 'status': STATUS_OK} # Define parameter space. See hyperopt web page for function definition. # http://hyperopt.github.io/hyperopt/getting-started/search_spaces/#parameter-expressions space = { 'max_depth': 3 + hp.randint('max_depth', 13), 'subsample': hp.uniform('subsample', 0.6, 1.0), 'colsample_bytree': hp.uniform('colsample_bytree', 0.6, 1.0), 'min_child_weight': hp.uniform('min_child_weight', 0.1, 10.0), } # Execute Bayesian optimization. max_evals = 20 trials = Trials() history = [] fmin(score, space, algo=tpe.suggest, trials=trials, max_evals=max_evals) # Output best parameters and score. history = sorted(history, key=lambda tpl: tpl[1]) best = history[0] print(f'Best params:{best[0]}, score"{best[1]:.4f}') # Best params:{'colsample_bytree': 0.8696055514792674, 'max_depth': 9, # 'min_child_weight': 7.079903514946092, 'subsample': 0.852555363495354}, # score"-0.9369
3.超参数调整和交叉验证中的KFolding
在上面讨论的超参数调整的方法中,为了避免过度拟合,重要的是首先对数据进行Kfold,对训练folds数据和out-of-fold数据重复训练和验证。
此外,如果在交叉验证中继续使用相同的folds拆分(以便对模型进行比较),则您的模型与所选的超参数可能已经过度拟合于folds,但是没有机会识别它。
因此,通过改变随机数种子,将folds splits从超参数调整改为交叉验证是非常重要的。
另一种方法可能是执行嵌套交叉验证。在嵌套交叉验证中,有两个层次的交叉验证循环:外部和内部。
嵌套交叉验证的一个巨大缺点是,由于内部循环folds数的增加,它大大增加了运行时间。
# Chose simpler model since this is demonstration of nested CV. from sklearn.linear_model import Lasso # KFold and cross_validate will do nested CV. from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold, cross_validate # used in declaration of distribution of parameters. import scipy.stats as stats import pandas as pd import numpy as np # importing some dataset and prepare train/test data for sklearn functions. df = pd.read_csv('data.csv',index_col=0) y = df['Value'] X = df.drop(['Value'],axis=1) X_train0, X_test, y_train0, y_test = train_test_split(X,y,test_size=0.2, random_state=1111) reg = Lasso() # Only one hyperparameter in LASSO. lassoParam = { 'alpha': stats.uniform(0.0001, 0.01), } # Prepare two Kfolds, one is for outer loop, the other is for inner loop. outer_cv = KFold(n_splits=5, shuffle=True, random_state=3333) inner_cv = KFold(n_splits=5, shuffle=True, random_state=3335) # This will choose the best hyperparameter. nestedcv_inner = RandomizedSearchCV(reg, lassoParam, cv=inner_cv, n_iter=20, scoring='r2', n_jobs=-1, random_state=4444, refit=True) # This will give generalized error by LASSO with hyperparamter chosed in inner loop. nestedcv_outer = cross_validate(nestedcv_inner,X_train,y_train,scoring='r2',cv=outer_cv, n_jobs=-1,return_estimator=True) # Chosen hyperparameter in each inner CV. print([nestedcv_outer['estimator'][i].best_params_ for i in range(5)]) ## [{'alpha': 0.0009295615415650937}, {'alpha': 0.0009295615415650937}, {'alpha': 0.0009295615415650937}, ## {'alpha': 0.0009295615415650937}, {'alpha': 0.0009295615415650937}] ## * Seeing all the same value may sounds strange but not wrong. ## This time the data was 'too easy' and smaller alpha was always better, ## and because inner_cv's random_state could not change at every outer loop, the parameter search walked through ## the same paraemter candidates and ended up with finding the same best parameter. # Outer loop CV scores. print(nestedcv_outer['test_score']) ## [0.77166781 0.75451344 0.76503072 0.75422108 0.74384193]
最后
我们在超参数调整中采用的方法会随着建模阶段的发展而变化,首先是通过手动或网格搜索从较少数量的参数开始,随着模型变得更好,通过随机搜索或贝叶斯优化来查看更多参数,但没有固定的规则。