Kaggle入门级赛题:房价预测——数据分析篇
本次分享的项目来自 Kaggle 的经典赛题:房价预测。分为数据分析和数据挖掘两部分介绍。本篇为数据分析篇。
赛题解读
比赛概述
影响房价的因素有很多,在本题的数据集中有 79 个变量几乎描述了爱荷华州艾姆斯 (Ames, Iowa) 住宅的方方面面,要求预测最终的房价。
技术栈
- 特征工程 (Creative feature engineering)
- 回归模型 (Advanced regression techniques like random forest and
gradient boosting)
最终目标
预测出每间房屋的价格,对于测试集中的每一个Id
,给出变量SalePrice
相应的值。
提交格式
Id,SalePrice 1461,169000.1 1462,187724.1233 1463,175221 etc.
数据分析
数据描述
首先我们导入数据并查看:
train_df = pd.read_csv('./input/train.csv', index_col=0) test_df = pd.read_csv('./input/test.csv', index_col=0)
train_df.head()
我们可以看到有 80 列,也就是有 79 个特征。
接下来将训练集和测试集合并在一起,这么做是为了进行数据预处理的时候更加方便,让测试集和训练集的特征变换为相同的格式,等预处理进行完之后,再把他们分隔开。
我们知道SalePrice
作为我们的训练目标,只出现在训练集中,不出现在测试集,因此我们需要把这一列拿出来再进行合并。在拿出这一列前,我们先来观察它,看看它长什么样子,也就是查看它的分布。
prices = DataFrame({'price': train_df['SalePrice'], 'log(price+1)': np.log1p(train_df['SalePrice'])}) prices.hist()
因为label
本身并不平滑,为了我们分类器的学习更加准确,我们需要首先把label
给平滑化(正态化)。我在这里使用的是log1p
, 也就是 log(x+1)
。要注意的是我们这一步把数据平滑化了,在最后算结果的时候,还要把预测到的平滑数据给变回去,那么log1p()
的反函数就是expm1()
,后面用到时再具体细说。
然后我们把这一列拿出来:
y_train = np.log1p(train_df.pop('SalePrice')) y_train.head()
有
Id 1 12.247699 2 12.109016 3 12.317171 4 11.849405 5 12.429220 Name: SalePrice, dtype: float64
这时,y_train
就是SalePrice
那一列。
然后我们把两个数据集合并起来:
df = pd.concat((train_df, test_df), axis=0)
查看shape
:
df.shape (2919, 79)
df
就是我们合并之后的DataFrame。
数据预处理
根据 kaggle 给出的说明,有以下特征及其说明:
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict. MSSubClass: The building class MSZoning: The general zoning classification LotFrontage: Linear feet of street connected to property LotArea: Lot size in square feet Street: Type of road access Alley: Type of alley access LotShape: General shape of property LandContour: Flatness of the property Utilities: Type of utilities available LotConfig: Lot configuration LandSlope: Slope of property Neighborhood: Physical locations within Ames city limits Condition1: Proximity to main road or railroad Condition2: Proximity to main road or railroad (if a second is present) BldgType: Type of dwelling HouseStyle: Style of dwelling OverallQual: Overall material and finish quality OverallCond: Overall condition rating YearBuilt: Original construction date YearRemodAdd: Remodel date RoofStyle: Type of roof RoofMatl: Roof material Exterior1st: Exterior covering on house Exterior2nd: Exterior covering on house (if more than one material) MasVnrType: Masonry veneer type MasVnrArea: Masonry veneer area in square feet ExterQual: Exterior material quality ExterCond: Present condition of the material on the exterior Foundation: Type of foundation BsmtQual: Height of the basement BsmtCond: General condition of the basement BsmtExposure: Walkout or garden level basement walls BsmtFinType1: Quality of basement finished area BsmtFinSF1: Type 1 finished square feet BsmtFinType2: Quality of second finished area (if present) BsmtFinSF2: Type 2 finished square feet BsmtUnfSF: Unfinished square feet of basement area TotalBsmtSF: Total square feet of basement area Heating: Type of heating HeatingQC: Heating quality and condition CentralAir: Central air conditioning Electrical: Electrical system 1stFlrSF: First Floor square feet 2ndFlrSF: Second floor square feet LowQualFinSF: Low quality finished square feet (all floors) GrLivArea: Above grade (ground) living area square feet BsmtFullBath: Basement full bathrooms BsmtHalfBath: Basement half bathrooms FullBath: Full bathrooms above grade HalfBath: Half baths above grade Bedroom: Number of bedrooms above basement level Kitchen: Number of kitchens KitchenQual: Kitchen quality TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) Functional: Home functionality rating Fireplaces: Number of fireplaces FireplaceQu: Fireplace quality GarageType: Garage location GarageYrBlt: Year garage was built GarageFinish: Interior finish of the garage GarageCars: Size of garage in car capacity GarageArea: Size of garage in square feet GarageQual: Garage quality GarageCond: Garage condition PavedDrive: Paved driveway WoodDeckSF: Wood deck area in square feet OpenPorchSF: Open porch area in square feet EnclosedPorch: Enclosed porch area in square feet 3SsnPorch: Three season porch area in square feet ScreenPorch: Screen porch area in square feet PoolArea: Pool area in square feet PoolQC: Pool quality Fence: Fence quality MiscFeature: Miscellaneous feature not covered in other categories MiscVal: $Value of miscellaneous feature MoSold: Month Sold YrSold: Year Sold SaleType: Type of sale SaleCondition: Condition of sale
接下来我们对特征进行分析。上述列出了一个目标变量SalePrice
和 79 个特征,数量较多,这一步的特征分析是为了之后的特征工程做准备。
我们来查看哪些特征存在缺失值:
print(pd.isnull(df).sum())
这样并不方便观察,我们先查看缺失值最多的 10 个特征:
df.isnull().sum().sort_values(ascending=False).head(10)
为了更清楚的表示,我们用缺失率来考察缺失情况:
df_na = (df.isnull().sum() / len(df)) * 100 df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False) missing_data = pd.DataFrame({'缺失率': df_na}) missing_data.head(10)
对其进行可视化:
f, ax = plt.subplots(figsize=(15,12)) plt.xticks(rotation='90') sns.barplot(x=df_na.index, y=df_na) plt.xlabel('Features', fontsize=15) plt.ylabel('Percent of missing values', fontsize=15) plt.title('Percent missing data by feature', fontsize=15)
我们可以看到PoolQC、MiscFeature、Alley、Fence、FireplaceQu 等特征存在大量缺失,LotFrontage 有 16.7% 的缺失率,GarageType、GarageFinish、GarageQual 和 GarageCond等缺失率相近,这些特征有的是 category 数据,有的是 numerical 数据,对它们的缺失值如何处理,将在关于特征工程的部分给出。
最后,我们对每个特征进行相关性分析,查看热力图:
corrmat = train_df.corr() plt.subplots(figsize=(15,12)) sns.heatmap(corrmat, vmax=0.9, square=True)
我们看到有些特征相关性大,容易造成过拟合现象,因此需要进行剔除。在下一篇的数据挖掘篇我们来对这些特征进行处理并训练模型。
不足之处,欢迎指正。