用于回归问题的深度神经网络
神经网络对于分类问题是众所周知的,例如,它们用于手写数字分类,但问题是如果我们将它们用于回归问题会有成效吗?
在本文中,我将使用深度神经网络使用Kaggle的数据集来预测房屋定价。
您可以从此处下载数据集(https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
首先,我们将导入所需的Python依赖项:
from keras.callbacks import ModelCheckpoint from keras.models import Sequential from keras.layers import Dense, Activation, Flatten from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error from matplotlib import pyplot as plt import seaborn as sb import matplotlib.pyplot as plt import pandas as pd import numpy as np import warnings warnings.filterwarnings('ignore') warnings.filterwarnings('ignore', category=DeprecationWarning) from xgboost import XGBRegressor
处理数据集
我们不会深入处理数据集,我们要做的就是准备好将数据集输入到我们的机器学习模型中。
我们将摆脱任何缺失值的特征,然后我们将对分类特征进行编码,就是这样。
加载数据集:
- 将训练和测试数据加载到pandas DataFrames中
- 结合训练和测试数据一起处理它们
Python代码如下:
def get_data(): #get train data train_data_path ='train.csv' train = pd.read_csv(train_data_path) #get test data test_data_path ='test.csv' test = pd.read_csv(test_data_path) return train , test def get_combined_data(): #reading train data train , test = get_data() target = train.SalePrice train.drop(['SalePrice'],axis = 1 , inplace = True) combined = train.append(test) combined.reset_index(inplace=True) combined.drop(['index', 'Id'], inplace=True, axis=1) return combined, target #Load train and test data into pandas DataFrames train_data, test_data = get_data() #Combine train and test data to process them together combined, target = get_combined_data() combined.describe()
让我们定义一个函数来获得没有任何缺失值的列,Python代码如下:
def get_cols_with_no_nans(df,col_type): ''' Arguments : df : The dataframe to process col_type : num : to only get numerical columns with no nans no_num : to only get nun-numerical columns with no nans all : to get any columns with no nans ''' if (col_type == 'num'): predictors = df.select_dtypes(exclude=['object']) elif (col_type == 'no_num'): predictors = df.select_dtypes(include=['object']) elif (col_type == 'all'): predictors = df else : print('Error : choose a type (num, no_num, all)') return 0 cols_with_no_nans = [] for col in predictors.columns: if not df[col].isnull().any(): cols_with_no_nans.append(col) return cols_with_no_nans
获取没有任何缺失值的列
num_cols = get_cols_with_no_nans(combined , 'num') cat_cols = get_cols_with_no_nans(combined , 'no_num')
让我们看看我们得到了多少列,Python代码如下:
print ('Number of numerical columns with no nan values :',len(num_cols)) print ('Number of nun-numerical columns with no nan values :',len(cat_cols))
[out]:
Number of numerical columns with no nan values : 25
Number of nun-numerical columns with no nan values : 20
combined = combined[num_cols + cat_cols] combined.hist(figsize = (12,10)) plt.show()
特征的直方图
特征之间的相关性
train_data = train_data[num_cols + cat_cols] train_data['Target'] = target C_mat = train_data.corr() fig = plt.figure(figsize = (15,15)) sb.heatmap(C_mat, vmax = .8, square = True) plt.show()
从上面的相关热图中,我们看到大约15个特征与目标高度相关。
分类特征进行One-hot编码
我们将使用One-hot编码对分类特征进行编码。Python代码如下:
def oneHotEncode(df,colNames): for col in colNames: if( df[col].dtype == np.dtype('object')): dummies = pd.get_dummies(df[col],prefix=col) df = pd.concat([df,dummies],axis=1) #drop the encoded column df.drop([col],axis = 1 , inplace=True) return df print('There were {} columns before encoding categorical features'.format(combined.shape[1])) combined = oneHotEncode(combined, cat_cols) print('There are {} columns after encoding categorical features'.format(combined.shape[1]))
[out]:
There were 45 columns before encoding categorical features
There are 149 columns after encoding categorical features
现在,将dataFrame分解为训练数据和测试数据
def split_combined(): global combined train = combined[:1460] test = combined[1460:] return train , test train, test = split_combined()
制作深度神经网络
- 定义序列模型
- 添加一些dense 层
- 使用“ relu ”作为隐藏层的激活功能
- 使用“ normal”初始值设定项作为kernal_intializer
- 我们将使用mean_absolute_error作为损失函数
- 仅使用一个节点定义输出层
- 使用' linear '作为输出层的激活函数
NN_model = Sequential() # The Input Layer : NN_model.add(Dense(128, kernel_initializer='normal',input_dim = train.shape[1], activation='relu')) # The Hidden Layers : NN_model.add(Dense(256, kernel_initializer='normal',activation='relu')) NN_model.add(Dense(256, kernel_initializer='normal',activation='relu')) NN_model.add(Dense(256, kernel_initializer='normal',activation='relu')) # The Output Layer : NN_model.add(Dense(1, kernel_initializer='normal',activation='linear')) # Compile the network : NN_model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_absolute_error']) NN_model.summary()
定义检查点回调:
Python代码如下:
checkpoint_name = 'Weights-{epoch:03d}--{val_loss:.5f}.hdf5' checkpoint = ModelCheckpoint(checkpoint_name, monitor='val_loss', verbose = 1, save_best_only = True, mode ='auto') callbacks_list = [checkpoint]
训练机器学习模型:
NN_model.fit(train, target, epochs=500, batch_size=32, validation_split = 0.2, callbacks=callbacks_list)
# Load wights file of the best model : wights_file = 'Weights-478--18738.19831.hdf5' # choose the best checkpoint NN_model.load_weights(wights_file) # load it NN_model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_absolute_error'])
我们看到最佳模型的验证损失是18738.19
一点也不差,通过一些更多的预处理和更多的训练,我们可以做得更好。
尝试另一种机器学习(ML)算法:
现在,让我们尝试另一种机器学习(ML)算法来比较结果。
我们将使用随机森林回归和XGBRegressor。
将数据拆分为训练和验证数据
train_X, val_X, train_y, val_y = train_test_split(train, target, test_size = 0.25, random_state = 14)
我们将首先尝试随机森林模型:
model = RandomForestRegressor()
model.fit(train_X,train_y)
# Get the mean absolute error on the validation data
predicted_prices = model.predict(val_X)
MAE = mean_absolute_error(val_y , predicted_prices)
print('Random forest validation MAE = ', MAE)
Random forest validation MAE = 19089.71589041096
现在,让我们尝试XGBoost模型:
XGBModel = XGBRegressor()
XGBModel.fit(train_X,train_y , verbose=False)
# Get the mean absolute error on the validation data :
XGBpredictions = XGBModel.predict(val_X)
MAE = mean_absolute_error(val_y , XGBpredictions)
print('XGBoost validation MAE = ',MAE)
[out]:
XGBoost validation MAE = 17869.75410958904
我真的不认为神经网络会击败随机森林和XGBoost算法,但让我们尽量不要过于乐观,记住我们没有在随机森林和XGBoost模型上配置任何超参数,我相信如果我们这样做,这两个模型将超过神经网络。
总结:
- 我们加载并处理数据集
- 我们通过绘制一些直方图和特征的相关热图来熟悉数据集
- 我们使用了具有三个隐藏层的深度神经网络,每个隐藏层具有256个节点。
- 我们在输出层上使用了线性激活函数
- 我们还测试了另外两个模型
- 我们的深度神经网络能够超越这两个机器学习模型
- 如果我们调整它们的超参数,这两个模型可以击败深度神经网络模型。