计算多元线性回归中的系数

时间:2016-07-07 16:25:57

标签: python numpy linear-regression statsmodels

我正在尝试使用多元线性回归计算系数。我正在使用statsmodels库来计算系数。问题是,使用此代码,我收到错误ValueError: endog and exog matrices are different sizes。我得到了这个错误,因为在这个例子中y集合有4个元素,而X集合有一个包含7个ndarrays的列表,其中每个列表有5个元素。

但我不明白的是,x集(不是X)是一个列表,里面有4个列表(y有4个元素),其中每个list由7个变量组成。对我而言,xy具有相同数量的元素。

如何解决此错误?

import numpy as np
import statsmodels.api as sm

def test_linear_regression():
    x = [[0.0, 1102249463.0, 44055788.0, 9.0, 2.0, 32000.0, 49222464.0], [0.0, 1102259506.0, 44049537.0, 9.0, 2.0, 32000.0, 49222464.0], [0.0, 1102249463.0, 44055788.0, 9.0, 2.0, 32000.0, 49222464.0], [0.0, 1102259506.0, 44049537.0, 10.0, 2.0, 32000.0, 49222464.0]]

    y = [71.7554421425, 37.5205008984, 44.9945571423, 53.5441429615]
    reg_m(y, x)

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    y.append(1)
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results


if __name__ == "__main__":
    test_linear_regression()

1 个答案:

答案 0 :(得分:1)

假设x中的每个列表都对应y的每个值:

x = [[0.0, 1102249463.0, 44055788.0, 9.0, 2.0, 32000.0, 49222464.0],
     [0.0, 1102259506.0, 44049537.0, 9.0, 2.0, 32000.0, 49222464.0],
     [0.0, 1102249463.0, 44055788.0, 9.0, 2.0, 32000.0, 49222464.0],
     [0.0, 1102259506.0, 44049537.0, 10.0, 2.0, 32000.0, 49222464.0]
     ]

y = [71.7554421425, 37.5205008984, 44.9945571423, 53.5441429615]

def reg_m(x, y):
  x = np.array(x)
  y = np.array(y)

  # adds a constant of ones for y intercept
  X = np.insert(x, 0, np.ones((1,)), axis=1)

  # or, if you REALLY want to use add_constant, to add ones, use this
  # X = sm.add_constant(x, has_constant='add')

  return sm.OLS(y, X).fit()

model = reg_m(x, y)

要查看模型的摘要打印输出,只需model.summary()

"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.450
Model:                            OLS   Adj. R-squared:                 -0.649
Method:                 Least Squares   F-statistic:                    0.4096
Date:                Thu, 07 Jul 2016   Prob (F-statistic):              0.741
Time:                        21:50:12   Log-Likelihood:                -14.665
No. Observations:                   4   AIC:                             35.33
Df Residuals:                       1   BIC:                             33.49
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const      -1.306e-07   2.18e-07     -0.599      0.657      -2.9e-06  2.64e-06
x1         -3.086e-11   5.15e-11     -0.599      0.657     -6.86e-10  6.24e-10
x2            -0.0001      0.000     -0.900      0.534        -0.002     0.002
x3             0.0031      0.003      0.900      0.534        -0.041     0.047
x4            16.0236     26.761      0.599      0.657      -324.006   356.053
x5          8.321e-12   9.25e-12      0.900      0.534     -1.09e-10  1.26e-10
x6          1.331e-07   1.48e-07      0.900      0.534     -1.75e-06  2.01e-06
x7             0.0002      0.000      0.900      0.534        -0.003     0.003
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.500
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.167
Skew:                          -0.000   Prob(JB):                        0.920
Kurtosis:                       2.000   Cond. No.                          inf
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
[3] The smallest eigenvalue is      0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""