Kaggle入门级赛题:房价预测——数据分析篇

本次分享的项目来自 Kaggle 的经典赛题:房价预测。分为数据分析和数据挖掘两部分介绍。本篇为数据分析篇。


赛题解读

比赛概述

影响房价的因素有很多,在本题的数据集中有 79 个变量几乎描述了爱荷华州艾姆斯 (Ames, Iowa) 住宅的方方面面,要求预测最终的房价。

技术栈

  • 特征工程 (Creative feature engineering)
  • 回归模型 (Advanced regression techniques like random forest and
    gradient boosting)

最终目标

预测出每间房屋的价格,对于测试集中的每一个Id,给出变量SalePrice相应的值。

提交格式

Id,SalePrice
1461,169000.1
1462,187724.1233
1463,175221
etc.

数据分析

数据描述

首先我们导入数据并查看:

train_df = pd.read_csv('./input/train.csv', index_col=0)
test_df = pd.read_csv('./input/test.csv', index_col=0)
train_df.head()

Kaggle入门级赛题:房价预测——数据分析篇

我们可以看到有 80 列,也就是有 79 个特征。

接下来将训练集和测试集合并在一起,这么做是为了进行数据预处理的时候更加方便,让测试集和训练集的特征变换为相同的格式,等预处理进行完之后,再把他们分隔开。

我们知道SalePrice作为我们的训练目标,只出现在训练集中,不出现在测试集,因此我们需要把这一列拿出来再进行合并。在拿出这一列前,我们先来观察它,看看它长什么样子,也就是查看它的分布。

prices = DataFrame({'price': train_df['SalePrice'], 'log(price+1)': np.log1p(train_df['SalePrice'])})
prices.hist()

Kaggle入门级赛题:房价预测——数据分析篇

因为label本身并不平滑,为了我们分类器的学习更加准确,我们需要首先把label平滑化(正态化)。我在这里使用的是log1p, 也就是 log(x+1)。要注意的是我们这一步把数据平滑化了,在最后算结果的时候,还要把预测到的平滑数据给变回去,那么log1p()的反函数就是expm1(),后面用到时再具体细说。

然后我们把这一列拿出来:

y_train = np.log1p(train_df.pop('SalePrice'))

y_train.head()

Id
1    12.247699
2    12.109016
3    12.317171
4    11.849405
5    12.429220
Name: SalePrice, dtype: float64

这时,y_train就是SalePrice那一列。

然后我们把两个数据集合并起来:

df = pd.concat((train_df, test_df), axis=0)

查看shape:

df.shape

(2919, 79)

df就是我们合并之后的DataFrame。


数据预处理

根据 kaggle 给出的说明,有以下特征及其说明:

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale

接下来我们对特征进行分析。上述列出了一个目标变量SalePrice和 79 个特征,数量较多,这一步的特征分析是为了之后的特征工程做准备。

我们来查看哪些特征存在缺失值:

print(pd.isnull(df).sum())

Kaggle入门级赛题:房价预测——数据分析篇
Kaggle入门级赛题:房价预测——数据分析篇

这样并不方便观察,我们先查看缺失值最多的 10 个特征:

df.isnull().sum().sort_values(ascending=False).head(10)

Kaggle入门级赛题:房价预测——数据分析篇

为了更清楚的表示,我们用缺失率来考察缺失情况:

df_na = (df.isnull().sum() / len(df)) * 100
df_na = df_na.drop(df_na[df_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'缺失率': df_na})
missing_data.head(10)

Kaggle入门级赛题:房价预测——数据分析篇

对其进行可视化:

f, ax = plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=df_na.index, y=df_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

Kaggle入门级赛题:房价预测——数据分析篇

我们可以看到PoolQCMiscFeatureAlleyFenceFireplaceQu 等特征存在大量缺失,LotFrontage 有 16.7% 的缺失率,GarageTypeGarageFinishGarageQualGarageCond等缺失率相近,这些特征有的是 category 数据,有的是 numerical 数据,对它们的缺失值如何处理,将在关于特征工程的部分给出。

最后,我们对每个特征进行相关性分析,查看热力图:

corrmat = train_df.corr()
plt.subplots(figsize=(15,12))
sns.heatmap(corrmat, vmax=0.9, square=True)

Kaggle入门级赛题:房价预测——数据分析篇

我们看到有些特征相关性大,容易造成过拟合现象,因此需要进行剔除。在下一篇的数据挖掘篇我们来对这些特征进行处理并训练模型。


不足之处,欢迎指正。

相关推荐