机器学习:多元回归Python示例
当有一组以上的输入特征作为方程状态时,我们使用多元回归:
Y=x0+(x1*w1+x2*w2+x3*w3+....+xn*wn)
其中x1,x2,x3,... .xn是输入特征。
在现实世界中,在简单的线性回归中处理二维数据并不容易。
我们来看示例 HousePrice.csv数据集(https://github.com/neelindresh/NeelBlog/blob/master/HousePrice.csv),它将包含8列。我们要考虑所有8列(记住日期列不是一个特征,它只是信息,所以我们可以忽略它)。这给我们留下了7个特征作为X轴和1个特征作为Y轴或目标值。Python代码如下
import pandas
import numpy as np
#load csv file
df=pandas.read_csv('./DataSet/HousePrice.csv')
print(df.describe())
df=df.drop(['Date'],axis=1)
X=df[list(df.columns)[:-1]]
Y=df[list(df.columns)[-1]]
#print(X)
#print(Y)
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,random_state=0)
from sklearn.linear_model import LinearRegression
reg=LinearRegression()
reg.fit(xtrain,ytrain)
print(list(reg.predict(xtrain))[:5])
print(reg.score(xtest,ytest))
使用Pandas库导入CSV文件:
df=pandas.read_csv('./DataSet/HousePrice.csv')
print(df.describe())
pandas .read_csv()从特定位置读取csv文件。
删除不需要的数据(在此表中它是['Date']列)
df.drop(['Date'],axis=1)
这是删除'Date'后的数据集:
[|Price (All)|Change (All)|Price (New)|Change (New)|Price (Modern)|Change (Modern)|Price (Older)|Change (Older)|]
定义 X (Feature Set)和 Y (Target)
X=df[list(df.columns)[:-1]]
Y=df[list(df.columns)[-1]]
所以Y或我们的目标特征变成'Price(New)',其余的变成我们的特征集或X.
X=[‘Price (All)’, ‘Change (All)’, ‘Price (New)’, ‘Change (New)’, ‘Price (Modern)’, ‘Change (Modern)’, ‘Price (Older)’]
Y=[Change (Older)]
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(X,Y,random_state=0)
from sklearn.linear_model import LinearRegression
reg=LinearRegression()
reg.fit(xtrain,ytrain)
TrainTestSplit将75%培训中的数据集分为25%的测试数据
xTrain,XTEST,yTrain,yTest = train_test_split(X,Y)
LinearRegression().fit(X,Y)-> 将x值和y值分别放入给定函数中
Lreg=LinearRegression().fit(xTrain,yTrain)
精度测量
reg.predict(xTest)
Lreg.score(xTest,yTest)
.score(预测值,测试数据的Y轴)方法返回准确度得分或预测值与实际值相匹配的百分比。
.predict(测试数据的x轴)返回xTest中每个项目的预测值列表。
OUTPUT:
[‘Price (All)’, ‘Change (All)’, ‘Price (New)’, ‘Change (New)’, ‘Price (Modern)’, ‘Change (Modern)’, ‘Price (Older)’]
[32.811615948902435, 9.949265286326835, 31.495573493573865, 9.55246290821825, 2.734303573555142]
Accuracy Score: 0.9900852129223513
正如你所看到的,精度是99%
我将用波士顿数据集展示另一个多元回归的例子
#Code for Boston DataSet
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
#从Scikit-Learn数据集加载波士顿数据集:
boston = load_boston()
#定义X和Y并探索DataSet
X = boston.data
print(type(X))
print(boston.keys())
print('Feature names:',boston['feature_names'])
Y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, shuffle= True)
lineReg = LinearRegression()
lineReg.fit(X_train, y_train)
print(lineReg.score(X_test, y_test ))