第03章 线性回归模型¶

3.1 一元线性回归¶

线性回归模型

  • 利用线性拟合的方式探寻数据背后的规律
  • 先通过搭建线性回归模型寻找散点样本点背后的趋势线
  • 再利用回归曲线进行一些简答的预测分析或因果关系分析

3.1.1 一元线性回归的基本数学原理¶

一元线性回归模型(又称简单线性回归模型), 形式可表示为

  • $$y = ax + b$$

    其中$y$为因变量, $x$为自变量, $a$为回归系数, $b$为截距

在机器学习中, 残差平方和又称为模型的损失函数(loss function)

  • $$\sum(y^{(i)} - \hat{y}^{(i)})^2$$

    或

    $$\sum(y^{(i)} - (ax^{(i)} + b))^2$$

    其中 $y^{(i)}$ 为实际值, $\hat{y}^{(i)}$ 为与预测值

3.1.2 一元线性回归的代码实现¶

In [1]:
import warnings 

warnings.filterwarnings('ignore')
In [2]:
from sklearn.linear_model import LinearRegression 

regr = LinearRegression() #构造一个初始的线性回归模型并命名为regr

X = [[1], [2], [4], [5]]

Y = [2, 4, 6, 8] 

regr.fit(X, Y) #用fit()函数完成模型搭建,此时的regr就是一个搭建好的线性回归模型

y = regr.predict([[1.5]])

print(y)

y = regr.predict([[1.5], [2.5], [4.5]])

print(y)
[2.9]
[2.9 4.3 7.1]
In [3]:
import matplotlib.pyplot as plt

plt.scatter(X, Y) 
plt.plot(X, regr.predict(X)) 
plt.show()

#通过coef_和intercept_属性可以得到此时趋势线的系数和截距

print('系数a:' + str(regr.coef_[0])) 
print('截距b:' + str(regr.intercept_))
No description has been provided for this image
系数a:1.4000000000000004
截距b:0.7999999999999989

3.1.3 案例实战: 不同行业工龄与薪水的线性回归模型¶

In [4]:
import pandas as pd
df = pd.read_excel('IT行业收入表.xlsx')
df.head()
Out[4]:
工龄 薪水
0 0.0 10808
1 0.1 13611
2 0.2 12306
3 0.3 12151
4 0.3 13057
In [5]:
# 此时的工龄为自变量,薪水为因变量,通过如下代码进行自变量、因变量选取
X = df[['工龄']]
Y = df['薪水']
In [6]:
# 通过如下代码可以将此时的散点图绘制出来:
from matplotlib import pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.scatter(X,Y)
plt.xlabel('工龄')
plt.ylabel('薪水')
plt.show()
No description has been provided for this image
In [7]:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()  # 引入模型
regr.fit(X,Y)  # 训练模型
Out[7]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [8]:
plt.scatter(X,Y)
plt.plot(X, regr.predict(X), color='red')  # color='red'设置为红色
plt.xlabel('工龄')
plt.ylabel('薪水')
plt.show()
No description has been provided for this image
In [9]:
print('系数a为:' + str(regr.coef_[0]))
print('截距b为:' + str(regr.intercept_))
系数a为:2497.1513476046866
截距b为:10143.131966873787

补充知识点: 模型优化¶

一元多次线性回归模型, 比如一元二次线性回归模型

  • $$y = ax^2 + bx + c$$
In [10]:
# 生成二次项数据:
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
X_ = poly_reg.fit_transform(X)
In [11]:
print(X_[0:5])
[[1.   0.   0.  ]
 [1.   0.1  0.01]
 [1.   0.2  0.04]
 [1.   0.3  0.09]
 [1.   0.3  0.09]]
In [12]:
# 模型训练
regr = LinearRegression()
regr.fit(X_, Y)
Out[12]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [13]:
# 可视化
plt.scatter(X,Y)
plt.plot(X, regr.predict(X_), color='red')
plt.show()
No description has been provided for this image
In [14]:
# 打印系数和常数项
print(regr.coef_)  # 获取系数a, b 
print(regr.intercept_)  # 获取常数项c
[   0.         -743.68080444  400.80398224]
13988.159332096886

3.2 线性回归模型评估¶

  • 衡量线性拟合的优劣

    • R-squared

    • Adj.R-squared

  • 衡量特征变量的显著性

    • P值

3.2.1 模型评估的编程实现¶

模型评估的数学原理比较复杂, 因此次数侧重于实战应用

In [15]:
# 1.读取数据
import pandas
df = pandas.read_excel('IT行业收入表.xlsx')
X = df[['工龄']]
Y = df['薪水']

# 2.模型训练
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)

# 3.模型可视化
from matplotlib import pyplot as plt
plt.scatter(X,Y)
plt.plot(X, regr.predict(X), color='red')  # color='red'设置为红色
plt.xlabel('工龄')
plt.ylabel('薪水')
plt.show()

# 4.线性回归方程构造
print('系数a为:' + str(regr.coef_[0]))
print('截距b为:' + str(regr.intercept_))
No description has been provided for this image
系数a为:2497.1513476046866
截距b为:10143.131966873787
In [16]:
import statsmodels.api as sm
X2 = sm.add_constant(X)
est = sm.OLS(Y, X2).fit()
est.summary()  # 在非Jupyter Notebook的编辑器中需要写成print(est.summary())
Out[16]:
OLS Regression Results
Dep. Variable: 薪水 R-squared: 0.855
Model: OLS Adj. R-squared: 0.854
Method: Least Squares F-statistic: 578.5
Date: Fri, 14 Mar 2025 Prob (F-statistic): 6.69e-43
Time: 19:56:30 Log-Likelihood: -930.83
No. Observations: 100 AIC: 1866.
Df Residuals: 98 BIC: 1871.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 1.014e+04 507.633 19.981 0.000 9135.751 1.12e+04
工龄 2497.1513 103.823 24.052 0.000 2291.118 2703.185
Omnibus: 0.287 Durbin-Watson: 0.555
Prob(Omnibus): 0.867 Jarque-Bera (JB): 0.463
Skew: 0.007 Prob(JB): 0.793
Kurtosis: 2.667 Cond. No. 9.49


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

3.2.2 模型评估的数学原理¶

R-squared

R2

Adj.R-squared

$$R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1}$$

其中$n$为样本数量, $k$为特征变量数量, $k$越大, 对 $R^2_{adj}$ 的负影响越大, 不要为了追求高 $R^2$ 而添加过多的特征变量

3.3 多元线性回归¶

多元线性回归的原理和一元线性回归的原理在本质上是一样的

3.3.1 多元线性回归的数学原理和代码实现¶

最优化问题

$$ \underset{\overrightarrow{\beta}}{\min} (\overrightarrow{y} - X\overrightarrow{\beta})^T (\overrightarrow{y} - X\overrightarrow{\beta}) $$

其中 $\overrightarrow{y}$ 为长度为 $n$ (样本数) 的列向量, $\overrightarrow{\beta}$ 为长度为 $k$ (特征个数) 的列向量, $X$ 为 $n \times k$ 的特征矩阵

即

$$ \begin{aligned} \frac{ \partial (\overrightarrow{y} - X\overrightarrow{\beta})^T (\overrightarrow{y} - X\overrightarrow{\beta})}{\partial \overrightarrow{\beta}} &= 0 \\ \frac{\partial (\overrightarrow{y} - X\overrightarrow{\beta})}{\partial \overrightarrow{\beta}} \cdot \frac{ \partial (\overrightarrow{y} - X\overrightarrow{\beta})^T (\overrightarrow{y} - X\overrightarrow{\beta})}{\partial (\overrightarrow{y} - X\overrightarrow{\beta})} &= 0 \\ \frac{\partial (\overrightarrow{y} - X\overrightarrow{\beta})}{\partial \overrightarrow{\beta}} \cdot 2 (\overrightarrow{y} - X\overrightarrow{\beta})&= 0 \\ X^T(\overrightarrow{y} - X\overrightarrow{\beta}) &=0 \\ X^T\overrightarrow{y} - X^TX\overrightarrow{\beta} &=0 \\ X^TX\overrightarrow{\beta} &=X^T\overrightarrow{y} \\ \overrightarrow{\beta} &=(X^TX)^{-1}X^T\overrightarrow{y} \\ \end{aligned} $$

3.3.2 案例实战: 客户价值预测模型¶

In [17]:
import pandas as pd
df = pd.read_excel('客户价值数据表.xlsx')
df.head()
Out[17]:
客户价值 历史贷款金额 贷款次数 学历 月收入 性别
0 1150 6488 2 2 9567 1
1 1157 5194 4 2 10767 0
2 1163 7066 3 2 9317 0
3 983 3550 3 2 10517 0
4 1205 7847 3 3 11267 1
In [18]:
X = df[['历史贷款金额', '贷款次数', '学历', '月收入', '性别']]
Y = df['客户价值']
In [19]:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(X,Y)
Out[19]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [20]:
print('各系数为:' + str(regr.coef_))
print('常数项系数k0为:' + str(regr.intercept_))
各系数为:[5.71421731e-02 9.61723492e+01 1.13452022e+02 5.61326459e-02
 1.97874093e+00]
常数项系数k0为:-208.42004079958383

其中这里通过regr.coef_获得是一个系数列表,分别对应不同特征变量前面的系数,也即k1、k2、k3、k4及k5,所以此时的多元线性回归曲线方程为:

$y = -208 + 0.057x_1 + 96x_2 + 113x_3 + 0.056x_4 + 1.97x_5$

In [21]:
import statsmodels.api as sm  # 引入线性回归模型评估相关库
X2 = sm.add_constant(X)
est = sm.OLS(Y, X2).fit()
est.summary()
Out[21]:
OLS Regression Results
Dep. Variable: 客户价值 R-squared: 0.571
Model: OLS Adj. R-squared: 0.553
Method: Least Squares F-statistic: 32.44
Date: Fri, 14 Mar 2025 Prob (F-statistic): 6.41e-21
Time: 19:56:30 Log-Likelihood: -843.50
No. Observations: 128 AIC: 1699.
Df Residuals: 122 BIC: 1716.
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -208.4200 163.810 -1.272 0.206 -532.699 115.859
历史贷款金额 0.0571 0.010 5.945 0.000 0.038 0.076
贷款次数 96.1723 25.962 3.704 0.000 44.778 147.567
学历 113.4520 37.909 2.993 0.003 38.406 188.498
月收入 0.0561 0.019 2.941 0.004 0.018 0.094
性别 1.9787 32.286 0.061 0.951 -61.934 65.891
Omnibus: 1.597 Durbin-Watson: 2.155
Prob(Omnibus): 0.450 Jarque-Bera (JB): 1.538
Skew: 0.264 Prob(JB): 0.464
Kurtosis: 2.900 Cond. No. 1.28e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.28e+05. This might indicate that there are
strong multicollinearity or other numerical problems.