机器学习:多元回归Python示例

"

当有一组以上的输入特征作为方程状态时,我们使用多元回归:

Y=x0+(x1*w1+x2*w2+x3*w3+....+xn*wn)

其中x1,x2,x3,... .xn是输入特征。

在现实世界中,在简单的线性回归中处理二维数据并不容易。

我们来看示例 HousePrice.csv数据集(https://github.com/neelindresh/NeelBlog/blob/master/HousePrice.csv),它将包含8列。我们要考虑所有8列(记住日期列不是一个特征,它只是信息,所以我们可以忽略它)。这给我们留下了7个特征作为X轴和1个特征作为Y轴或目标值。Python代码如下

import pandas

import numpy as np

#load csv file

df=pandas.read_csv('./DataSet/HousePrice.csv')

print(df.describe())

df=df.drop(['Date'],axis=1)

X=df[list(df.columns)[:-1]]

Y=df[list(df.columns)[-1]]

#print(X)

#print(Y)

from sklearn.model_selection import train_test_split

xtrain,xtest,ytrain,ytest=train_test_split(X,Y,random_state=0)

from sklearn.linear_model import LinearRegression

reg=LinearRegression()

reg.fit(xtrain,ytrain)

print(list(reg.predict(xtrain))[:5])

print(reg.score(xtest,ytest))

使用Pandas库导入CSV文件:

df=pandas.read_csv('./DataSet/HousePrice.csv')

print(df.describe())

pandas .read_csv()从特定位置读取csv文件。

删除不需要的数据(在此表中它是['Date']列)

df.drop(['Date'],axis=1)

这是删除'Date'后的数据集:

[|Price (All)|Change (All)|Price (New)|Change (New)|Price (Modern)|Change (Modern)|Price (Older)|Change (Older)|]

定义 X (Feature Set)和 Y (Target)

X=df[list(df.columns)[:-1]]

Y=df[list(df.columns)[-1]]

所以Y或我们的目标特征变成'Price(New)',其余的变成我们的特征集或X.

X=[‘Price (All)’, ‘Change (All)’, ‘Price (New)’, ‘Change (New)’, ‘Price (Modern)’, ‘Change (Modern)’, ‘Price (Older)’]

Y=[Change (Older)]

from sklearn.model_selection import train_test_split

xtrain,xtest,ytrain,ytest=train_test_split(X,Y,random_state=0)

from sklearn.linear_model import LinearRegression

reg=LinearRegression()

reg.fit(xtrain,ytrain)

TrainTestSplit将75%培训中的数据集分为25%的测试数据

xTrain,XTEST,yTrain,yTest = train_test_split(X,Y)

LinearRegression().fit(X,Y)-> 将x值和y值分别放入给定函数中

Lreg=LinearRegression().fit(xTrain,yTrain)

精度测量

reg.predict(xTest)

Lreg.score(xTest,yTest)

.score(预测值,测试数据的Y轴)方法返回准确度得分或预测值与实际值相匹配的百分比。

.predict(测试数据的x轴)返回xTest中每个项目的预测值列表。

OUTPUT:

[‘Price (All)’, ‘Change (All)’, ‘Price (New)’, ‘Change (New)’, ‘Price (Modern)’, ‘Change (Modern)’, ‘Price (Older)’]

[32.811615948902435, 9.949265286326835, 31.495573493573865, 9.55246290821825, 2.734303573555142]

Accuracy Score: 0.9900852129223513

正如你所看到的,精度是99%

我将用波士顿数据集展示另一个多元回归的例子

#Code for Boston DataSet

from sklearn.datasets import load_boston

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

#从Scikit-Learn数据集加载波士顿数据集:

boston = load_boston()

#定义X和Y并探索DataSet

X = boston.data

print(type(X))

print(boston.keys())

print('Feature names:',boston['feature_names'])

Y = boston.target

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, shuffle= True)

lineReg = LinearRegression()

lineReg.fit(X_train, y_train)

print(lineReg.score(X_test, y_test ))

".slice(6, -6)