使用遗传算法在XGBoost中调整超参数
介绍
遗传算法,其灵感来自查尔斯达尔文提出的自然选择过程。我们可以通过以下描述来理解自然过程及其与遗传算法的关系:
我们从具有某些特征的初始种群开始,如图1所示。将在特定环境中测试该初始种群,以观察该种群中的个体(父母)基于预定义的适应性标准的表现。机器学习的适应性可以是任何性能指标 - 准确度,精确度,召回率,F1分数,等等。根据适应度值,我们选择表现最佳的父母(“适者生存”),作为幸存的种群(图2)。
现在,存活下来的群体中的父母将通过交配产生后代,使用两个步骤的组合:交叉/重组和突变。在交叉的情况下,交配父母的基因(参数)将被重新组合,产生后代,每个孩子从父母双方遗传一些基因(参数)(图3)。
最后,在突变的情况下,一些基因(参数)的值将被改变以保持遗传多样性(图4),这使得自然/遗传算法通常能够得到更好的解决方案。
图5显示了第二代种群,这将包括幸存的父母和孩子。我们保留幸存的父母,以便保留最好的适应度参数,以防后代的适应度值比父母差。
XGBoost的遗传算法模块 :
我们将创建一个为XGBoost定制的遗传算法模块。以下是XGboost的描述:
XGBoost是一个优化的分布式梯度增强库,旨在实现 高效, 灵活和 便携。它在Gradient Boosting框架下实现机器学习算法 。
该模块将具有遵循以下四个步骤的功能:(i)初始化,(ii)选择,(iii)交叉和(iv)变异,类似于上面讨论的内容。
初始化:
第一步是初始化,其中参数随机初始化以创建总体。它类似于图1中所示的第一代种群。下面的Python代码显示了初始化过程,我们生成一个包含参数的向量。在XGBoost的情况下,我们选择了7个参数进行优化:learning_rate,n_estimators,max_depth,min_child_weight,subsample,colsample_bytree和gamma。可在此处找到这些参数的详细说明。
def initilialize_poplulation(numberOfParents): learningRate = np.empty([numberOfParents, 1]) nEstimators = np.empty([numberOfParents, 1], dtype = np.uint8) maxDepth = np.empty([numberOfParents, 1], dtype = np.uint8) minChildWeight = np.empty([numberOfParents, 1]) gammaValue = np.empty([numberOfParents, 1]) subSample = np.empty([numberOfParents, 1]) colSampleByTree = np.empty([numberOfParents, 1]) for i in range(numberOfParents): print(i) learningRate[i] = round(random.uniform(0.01, 1), 2) nEstimators[i] = random.randrange(10, 1500, step = 25) maxDepth[i] = int(random.randrange(1, 10, step= 1)) minChildWeight[i] = round(random.uniform(0.01, 10.0), 2) gammaValue[i] = round(random.uniform(0.01, 10.0), 2) subSample[i] = round(random.uniform(0.01, 1.0), 2) colSampleByTree[i] = round(random.uniform(0.01, 1.0), 2) population = np.concatenate((learningRate, nEstimators, maxDepth, minChildWeight, gammaValue, subSample, colSampleByTree), axis= 1) return population
如果上限设置为无穷大,则参数的限制要么基于XGBoost文档中描述的限制,要么基于合理的猜测。我们首先为每个参数创建一个空数组,然后用随机值填充它。
父母选择
在第二步中,我们使用初始种群训练我们的模型并计算适应度值。在这种情况下,我们将计算F1得分。Python代码如下:
def fitness_f1score(y_true, y_pred): fitness = round((f1_score(y_true, y_pred, average='weighted')), 4) return fitness #train the data annd find fitness score def train_population(population, dMatrixTrain, dMatrixtest, y_test): fScore = [] for i in range(population.shape[0]): param = { 'objective':'binary:logistic', 'learning_rate': population[i][0], 'n_estimators': population[i][1], 'max_depth': int(population[i][2]), 'min_child_weight': population[i][3], 'gamma': population[i][4], 'subsample': population[i][5], 'colsample_bytree': population[i][6], 'seed': 24} num_round = 100 xgbT = xgb.train(param, dMatrixTrain, num_round) preds = xgbT.predict(dMatrixtest) preds = preds>0.5 fScore.append(fitness_f1score(y_test, preds)) return fScore
我们将根据它们的适应度值来定义我们想要选择多少父母并根据所选父母创建一个数组。Python实现如下:
#select parents for mating def new_parents_selection(population, fitness, numParents): selectedParents = np.empty((numParents, population.shape[1])) #create an array to store fittest parents #find the top best performing parents for parentId in range(numParents): bestFitnessId = np.where(fitness == np.max(fitness)) bestFitnessId = bestFitnessId[0][0] selectedParents[parentId, :] = population[bestFitnessId, :] fitness[bestFitnessId] = -1 #set this value to negative, in case of F1-score, so this parent is not selected again return selectedParents
交叉
在遗传算法的情况下,有多种方法可以定义交叉,例如单点,两点和k点交叉,有序列表的均匀交叉和交叉。我们将使用均匀交叉,其中孩子的每个参数将基于特定分布从父母中独立地选择。在我们的例子中,我们将使用numpy随机函数的 “离散均匀”分布。Python代码如下:
''' Mate these parents to create children having parameters from these parents (we are using uniform crossover method) ''' def crossover_uniform(parents, childrenSize): crossoverPointIndex = np.arange(0, np.uint8(childrenSize[1]), 1, dtype= np.uint8) #get all the index crossoverPointIndex1 = np.random.randint(0, np.uint8(childrenSize[1]), np.uint8(childrenSize[1]/2)) # select half of the indexes randomly crossoverPointIndex2 = np.array(list(set(crossoverPointIndex) - set(crossoverPointIndex1))) #select leftover indexes children = np.empty(childrenSize) ''' Create child by choosing parameters from two parents selected using new_parent_selection function. The parameter values will be picked from the indexes, which were randomly selected above. ''' for i in range(childrenSize[0]): #find parent 1 index parent1_index = i%parents.shape[0] #find parent 2 index parent2_index = (i+1)%parents.shape[0] #insert parameters based on random selected indexes in parent 1 children[i, crossoverPointIndex1] = parents[parent1_index, crossoverPointIndex1] #insert parameters based on random selected indexes in parent 1 children[i, crossoverPointIndex2] = parents[parent2_index, crossoverPointIndex2] return children
突变
最后一步是通过随机选择一个参数并通过随机量改变值来引入孩子的多样性。我们还将引入一些限制,以便将改变的值限制在一定范围内。跳过这些约束可能会导致错误。
def mutation(crossover, numberOfParameters): #Define minimum and maximum values allowed for each parameter minMaxValue = np.zeros((numberOfParameters, 2)) minMaxValue[0:] = [0.01, 1.0] #min/max learning rate minMaxValue[1, :] = [10, 2000] #min/max n_estimator minMaxValue[2, :] = [1, 15] #min/max depth minMaxValue[3, :] = [0, 10.0] #min/max child_weight minMaxValue[4, :] = [0.01, 10.0] #min/max gamma minMaxValue[5, :] = [0.01, 1.0] #min/maxsubsample minMaxValue[6, :] = [0.01, 1.0] #min/maxcolsample_bytree # Mutation changes a single gene in each offspring randomly. mutationValue = 0 parameterSelect = np.random.randint(0, 7, 1) print(parameterSelect) if parameterSelect == 0: #learning_rate mutationValue = round(np.random.uniform(-0.5, 0.5), 2) if parameterSelect == 1: #n_estimators mutationValue = np.random.randint(-200, 200, 1) if parameterSelect == 2: #max_depth mutationValue = np.random.randint(-5, 5, 1) if parameterSelect == 3: #min_child_weight mutationValue = round(np.random.uniform(5, 5), 2) if parameterSelect == 4: #gamma mutationValue = round(np.random.uniform(-2, 2), 2) if parameterSelect == 5: #subsample mutationValue = round(np.random.uniform(-0.5, 0.5), 2) if parameterSelect == 6: #colsample mutationValue = round(np.random.uniform(-0.5, 0.5), 2) #indtroduce mutation by changing one parameter, and set to max or min if it goes out of range for idx in range(crossover.shape[0]): crossover[idx, parameterSelect] = crossover[idx, parameterSelect] + mutationValue if(crossover[idx, parameterSelect] > minMaxValue[parameterSelect, 1]): crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 1] if(crossover[idx, parameterSelect] < minMaxValue[parameterSelect, 0]): crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 0] return crossover
实现
我们将实现上面讨论的模块来训练数据集。该数据集来自UCI机器学习库(https://archive.ics.uci.edu/ml/machine-learning-databases/musk/)。它含有一组102个分子,其中39个被人类鉴定为具有可用于香水的气味,69个没有所需的气味。该数据集包含6,590个这些分子的低能构象,包含166个特征。作为本教程的目标,我们正在做最小的预处理来理解遗传算法。Python代码如下:
# Importing the libraries import numpy as np import pandas as pd import geneticXGboost #this is the module we crated above import xgboost as xgb np.random.seed(723) # Importing the dataset dataset = pd.read_csv('clean2.data', header=None) X = dataset.iloc[:, 2:168].values #discard first two coloums as these are molecule's name and conformation's name y = dataset.iloc[:, 168].values #extrtact last coloum as class (1 => desired odor, 0 => undesired odor) # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 97) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) #XGboost Classifier #model xgboost #use xgboost API now xgDMatrix = xgb.DMatrix(X_train, y_train) #create Dmatrix xgbDMatrixTest = xgb.DMatrix(X_test, y_test)
我们有8个父母,我们选择4个最适合的父母进行交配。我们将创建4代并监控适应度(f1值)。在下一代中,有一半的父母是上一代中最适合的。这将使我们保持最好的适应度分数至少与上一代相同,以防孩子的适应度分数更差。
numberOfParents = 8 #number of parents to start numberOfParentsMating = 4 #number of parents that will mate numberOfParameters = 7 #number of parameters that will be optimized numberOfGenerations = 4 #number of genration that will be created #define the population size populationSize = (numberOfParents, numberOfParameters) #initialize the population with randomly generated parameters population = geneticXGboost.initilialize_poplulation(numberOfParents) #define an array to store the fitness hitory fitnessHistory = np.empty([numberOfGenerations+1, numberOfParents]) #define an array to store the value of each parameter for each parent and generation populationHistory = np.empty([(numberOfGenerations+1)*numberOfParents, numberOfParameters]) #insert the value of initial parameters in history populationHistory[0:numberOfParents, :] = population for generation in range(numberOfGenerations): print("This is number %s generation" % (generation)) #train the dataset and obtain fitness fitnessValue = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test) fitnessHistory[generation, :] = fitnessValue #best score in the current iteration print('Best F1 score in the this iteration = {}'.format(np.max(fitnessHistory[generation, :]))) #survival of the fittest - take the top parents, based on the fitness value and number of parents needed to be selected parents = geneticXGboost.new_parents_selection(population=population, fitness=fitnessValue, numParents=numberOfParentsMating) #mate these parents to create children having parameters from these parents (we are using uniform crossover) children = geneticXGboost.crossover_uniform(parents=parents, childrenSize=(populationSize[0] - parents.shape[0], numberOfParameters)) #add mutation to create genetic diversity children_mutated = geneticXGboost.mutation(children, numberOfParameters) ''' We will create new population, which will contain parents that where selected previously based on the fitness score and rest of them will be children ''' population[0:parents.shape[0], :] = parents #fittest parents population[parents.shape[0]:, :] = children_mutated #children populationHistory[(generation+1)*numberOfParents : (generation+1)*numberOfParents+ numberOfParents , :] = population #srore parent information
最后,我们得到最好的分数和相关参数:
#Best solution from the final iteration fitness = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test) fitnessHistory[generation+1, :] = fitness #index of the best solution bestFitnessIndex = np.where(fitness == np.max(fitness))[0][0] #Best fitness print("Best fitness is =", fitness[bestFitnessIndex]) #Best parameters print("Best parameters are:") print('learning_rate', population[bestFitnessIndex][0]) print('n_estimators', population[bestFitnessIndex][1]) print('max_depth', int(population[bestFitnessIndex][2])) print('min_child_weight', population[bestFitnessIndex][3]) print('gamma', population[bestFitnessIndex][4]) print('subsample', population[bestFitnessIndex][5]) print('colsample_bytree', population[bestFitnessIndex][6])
现在让我们想象每一代种群适应度的变化(如下图所示)。虽然我们已经开始以高得分的F1得分(~0.98),在两个父母中,在随机生成的初始种群中,我们能够在最后一代中进一步提高它。初始人群中父母的最低F1分数为0.9143,最后一代父母中的一个最佳分数为0.9947。这表明我们可以通过简单的遗传算法实现来改进XGBoost中的性能指标。