使用Python的多项式回归

pandazjd

2018-10-09

机器学习中的线性回归要求因变量和自变量之间的关系是线性的。如果数据的分布更复杂，如下图所示，该怎么办？线性模型可以用于拟合非线性数据吗？我们如何生成最佳捕获数据的曲线呢？我们将在这篇文章中回答这些问题。

使用Python的多项式回归

为什么使用多项式回归？

为了理解多项式回归的必要性，让我们先生成一些随机机器学习数据集。

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
x = 2 - 3 * np.random.normal(0, 1, 20)
y = x - 2 * (x ** 2) + 0.5 * (x ** 3) + np.random.normal(-3, 3, 20)
plt.scatter(x,y, s=10)
plt.show()

生成的数据看起来像

使用Python的多项式回归

让我们将机器学习中的线性回归模型应用于此数据集。Python代码如下：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
np.random.seed(0)
x = 2 - 3 * np.random.normal(0, 1, 20)
y = x - 2 * (x ** 2) + 0.5 * (x ** 3) + np.random.normal(-3, 3, 20)
# transforming the data to include another axis
x = x[:, np.newaxis]
y = y[:, np.newaxis]
model = LinearRegression()
model.fit(x, y)
y_pred = model.predict(x)
plt.scatter(x, y, s=10)
plt.plot(x, y_pred, color='r')
plt.show()

最佳拟合线的图是

使用Python的多项式回归

我们可以看到直线无法捕获数据中的模式。这是一个不拟合的例子。计算线RMSE和R²得分给出：

RMSE of linear regression is 15.908242501429998.

R2 score of linear regression is 0.6386750054827146

为了克服不拟合，我们需要增加模型的复杂性。

为了生成一个高阶方程，我们可以添加原始特征的幂作为新特征。线性模型,

使用Python的多项式回归

可以转化为

使用Python的多项式回归

这仍然被认为是线性模型，因为与特征相关的系数/权重仍然是线性的。x²只是一个特征。然而我们拟合的曲线本质上是二次曲线。

为了将原始特征转换成更高阶的项，我们将使用scikit-learn提供的多项式特征类。接下来，我们使用线性回归对机器学习模型进行训练。Python代码如下：

import operator
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
x = 2 - 3 * np.random.normal(0, 1, 20)
y = x - 2 * (x ** 2) + 0.5 * (x ** 3) + np.random.normal(-3, 3, 20)
# transforming the data to include another axis
x = x[:, np.newaxis]
y = y[:, np.newaxis]
polynomial_features= PolynomialFeatures(degree=2)
x_poly = polynomial_features.fit_transform(x)
model = LinearRegression()
model.fit(x_poly, y)
y_poly_pred = model.predict(x_poly)
rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
r2 = r2_score(y,y_poly_pred)
print(rmse)
print(r2)
plt.scatter(x, y, s=10)
# sort the values of x before line plot
sort_axis = operator.itemgetter(0)
sorted_zip = sorted(zip(x,y_poly_pred), key=sort_axis)
x, y_poly_pred = zip(*sorted_zip)
plt.plot(x, y_poly_pred, color='m')
plt.show()

使用Python的多项式回归

To generate polynomial features (here 2nd degree polynomial)

------------------------------------------------------------

polynomial_features = PolynomialFeatures(degree=2)

x_poly = polynomial_features.fit_transform(x)

Explaination

------------

Let's take the first three rows of X:

[[-3.29215704]

[ 0.79952837]

[-0.93621395]]

If we apply polynomial transformation of degree 2, the feature vectors become

[[-3.29215704 10.83829796]

[ 0.79952837 0.63924562]

[-0.93621395 0.87649656]]

在转换后的特征上拟合线性回归模型得到如下图所示

使用Python的多项式回归

从图中可以清楚地看出，二次曲线能够比线性线更好地拟合数据。计算RMSE和R²得分给出：

RMSE of polynomial regression is 10.120437473614711.

R2 of polynomial regression is 0.8537647164420812.

我们可以看到,与线性相比，RMSE下降和R²分数增加

如果我们尝试拟合一个三次曲线(degree=3)到数据集，我们可以看到它通过的数据点比二次曲线和直线更多。

使用Python的多项式回归

三次曲线的指标是

RMSE is 3.449895507408725

R2 score is 0.9830071790386679

下面是机器学习数据集上拟合线性，二次和三次曲线的比较。

使用Python的多项式回归

如果我们进一步将degree增加到20，我们可以看到曲线通过更多数据点。下面是degree 3和degree 20曲线的比较。

使用Python的多项式回归

对于degree= 20，模型还捕获数据中的噪声。这是一个过度拟合的例子。即使这个模型传递了大部分数据，它也无法推广看不见的数据。

为了防止过度拟合，我们可以添加更多的训练样本，以便算法不会学习系统中的噪声并且可以变得更加通用。（注意：如果数据本身就是噪声，则添加更多数据可能会成为问题）。

我们如何选择最佳机器学习模型呢？要回答这个问题，我们需要了解偏差与方差的权衡。

偏见与差异的权衡取舍

偏差是指由于机器学习模型在拟合数据时的简单假设而导致的误差。高偏差意味着模型无法捕获数据中的模式，这导致欠拟合。

方差是指由于复杂模型试图拟合数据而导致的误差。高方差意味着模型通过大多数数据点，导致数据过度拟合。

下图总结了我们的学习经历。

使用Python的多项式回归

从下图可以看出，随着模型复杂度的增加，偏差减小，方差增大，反之亦然。理想情况下，机器学习模型应该具有低方差和低偏差。但实际上，两者兼而有之是不可能的。因此，为了得到一个既能在训练上表现良好，又能在不可见数据上表现良好的模型，需要进行权衡。

使用Python的多项式回归

到目前为止，我们已经涵盖了多项式回归背后的大部分理论。

将多项式回归应用于Boston Housing数据集

导入所需的Python库

import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd 
import seaborn as sns 
%matplotlib inline

加载数据

from sklearn.datasets import load_boston
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['MEDV'] = boston_dataset.target

数据可视化

# set the size of the figure
sns.set(rc={'figure.figsize':(11.7,8.27)})
# plot a histogram showing the distribution of the target values
sns.distplot(boston['MEDV'], bins=30)
plt.show()

使用Python的多项式回归

相关矩阵

# compute the pair wise correlation for all columns 
correlation_matrix = boston.corr().round(2)
# use the heatmap function from seaborn to plot the correlation matrix
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)

使用Python的多项式回归

观察

从上面的协同图可以看出MEDV与LSTAT, RM有很强的相关性
RAD和TAX是stronly相关的，所以为了避免多重共线性，我们不将其包含在特性中

plt.figure(figsize=(20, 5))
features = ['LSTAT', 'RM']
target = boston['MEDV']
for i, col in enumerate(features):
 plt.subplot(1, len(features) , i+1)
 x = boston[col]
 y = target
 plt.scatter(x, y, marker='o')
 plt.title(col)
 plt.xlabel(col)
 plt.ylabel('MEDV')

使用Python的多项式回归

我们可以看到LSTAT的变化并不完全是线性的。让我们应用二次多项式回归和检验。

将机器学习数据集分成训练和测试集

Python代码如下：

from sklearn.model_selection import train_test_split
X = pd.DataFrame(np.c_[boston['LSTAT'], boston['RM']], columns = ['LSTAT','RM'])
Y = boston['MEDV']
# splits the training and test data set in 80% : 20%
# assign random_state to any value.This ensures consistency.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)

让我们定义一个函数，它将原始特征转换为给定度数的多项式特征，然后对其应用线性回归。

from sklearn.preprocessing import PolynomialFeatures
def create_polynomial_regression_model(degree):
 "Creates a polynomial regression model for the given degree"
 poly_features = PolynomialFeatures(degree=degree)
 
 # transform the features to higher degree features.
 X_train_poly = poly_features.fit_transform(X_train)
 
 # fit the transformed features to Linear Regression
 poly_model = LinearRegression()
 poly_model.fit(X_train_poly, Y_train)
 
 # predicting on training data-set
 y_train_predicted = poly_model.predict(X_train_poly)
 
 # predicting on test data-set
 y_test_predict = poly_model.predict(poly_features.fit_transform(X_test))
 
 # evaluating the model on training dataset
 rmse_train = np.sqrt(mean_squared_error(Y_train, y_train_predicted))
 r2_train = r2_score(Y_train, y_train_predicted)
 
 # evaluating the model on test dataset
 rmse_test = np.sqrt(mean_squared_error(Y_test, y_test_predict))
 r2_test = r2_score(Y_test, y_test_predict)
 
 print("The model performance for the training set")
 print("-------------------------------------------")
 print("RMSE of training set is {}".format(rmse_train))
 print("R2 score of training set is {}".format(r2_train))
 
 print("
")
 
 print("The model performance for the test set")
 print("-------------------------------------------")
 print("RMSE of test set is {}".format(rmse_test))
 print("R2 score of test set is {}".format(r2_test))

使用Python的多项式回归

接下来，调用degree=2

create_polynomial_regression_model(2)

使用多项式回归的模型的性能：

The model performance for the training set

-------------------------------------------

RMSE of training set is 4.703071027847756

R2 score of training set is 0.7425094297364765

The model performance for the test set

-------------------------------------------

RMSE of test set is 3.784819884545044

R2 score of test set is 0.8170372495892174

结论

在本机器学习系列中，我们介绍了线性回归，多项式回归，并在Boston Housing数据集上实现了这两个模型。

python

安科网

使用Python的多项式回归

pandazjd

目录

为什么使用多项式回归？

偏见与差异的权衡取舍

将多项式回归应用于Boston Housing数据集

结论

pandazjd

相关推荐

python 发送get请求接口详解

python 使用tkinter+you-get实现视频下载器

python中requests模拟登录的三种方式(携带cookie/session进行请求网站)

python开发一个解析protobuf文件的简单编译器

python 下载文件的多种方法汇总

Linux Shell 如何获取参数的方法

python跨文件使用全局变量的实现

Python爬虫破解登陆哔哩哔哩的方法

python调用百度API实现人脸识别

Python调用ffmpeg开源视频处理库，批量处理视频

详解python os.path.exists判断文件或文件夹是否存在

python实现在列表中查找某个元素的下标示例

python如何获得list或numpy数组中最大元素对应的索引

Python实现列表索引批量删除的5种方法

python 爬虫如何实现百度翻译

致命错误！Python开发者的7个崩溃瞬间

针对Python开发人员的10个“疯狂”的项目构想

用Python内置模块处理ini配置文件

VS Code 中 Python 扩展的部分功能重构，支持 R 和 Julia

Python五个隐藏的特性，你可能从未听说过

pandazjd