数据:https://courses.edx.org/c4x/MITx/15.071x_2/asset/NBA_train.csv
我知道如何使用statsmodels.formula.api
将这些数据拟合到多元线性回归模型:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.formula.api as smf
model = smf.ols(formula="W ~ PTS + oppPTS", data=NBA).fit()
model.summary()
然而,我发现这个类似R的公式表示法很尴尬,我想使用通常的pandas语法:
import pandas as pd
NBA = pd.read_csv("NBA_train.csv")
import statsmodels.api as sm
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
使用第二种方法我收到以下错误:
ValueError: shapes (835,2) and (835,2) not aligned: 2 (dim 1) != 835 (dim 0)
为什么会发生以及如何解决?
答案 0 :(得分:11)
使用sm.OLS(y, X)
时,y
是因变量,X
是。{
自变量。
在公式W ~ PTS + oppPTS
中,W
是因变量,PTS
和oppPTS
是自变量。
因此,请使用
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
而不是
X = NBA['W']
y = NBA[['PTS', 'oppPTS']]
import pandas as pd
import statsmodels.api as sm
NBA = pd.read_csv("NBA_train.csv")
y = NBA['W']
X = NBA[['PTS', 'oppPTS']]
X = sm.add_constant(X)
model11 = sm.OLS(y, X).fit()
model11.summary()
产量
OLS Regression Results
==============================================================================
Dep. Variable: W R-squared: 0.942
Model: OLS Adj. R-squared: 0.942
Method: Least Squares F-statistic: 6799.
Date: Sat, 21 Mar 2015 Prob (F-statistic): 0.00
Time: 14:58:05 Log-Likelihood: -2118.0
No. Observations: 835 AIC: 4242.
Df Residuals: 832 BIC: 4256.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 41.3048 1.610 25.652 0.000 38.144 44.465
PTS 0.0326 0.000 109.600 0.000 0.032 0.033
oppPTS -0.0326 0.000 -110.951 0.000 -0.033 -0.032
==============================================================================
Omnibus: 1.026 Durbin-Watson: 2.238
Prob(Omnibus): 0.599 Jarque-Bera (JB): 0.984
Skew: 0.084 Prob(JB): 0.612
Kurtosis: 3.009 Cond. No. 1.80e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.