我正在尝试使用statsmodels建立一个多元线性回归模型。我希望模型包含一个常量,但添加不正确。我使用了另一个较小的数据集,但它起作用了,而我当前的数据集却不起作用。我当前的数据集大约是1000个观测值x 2000个变量。
# Multiple Linear Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
# Importing the dataset, y value is last column, other columns are X
dataset = pd.read_excel('sheet.xlsx')
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
#Stats
X2 = sm.add_constant(X)
test = sm.OLS(y, X2)
test2 = test.fit()
print(test2.summary())
我的输出看起来像这样:
/home/chasel88/.local/lib/python3.7/site-packages/statsmodels/regression/linear_model.py:1648: RuntimeWarning: divide by zero encountered in true_divide
return 1 - (np.divide(self.nobs - self.k_constant, self.df_resid)
/home/chasel88/.local/lib/python3.7/site-packages/statsmodels/regression/linear_model.py:1649: RuntimeWarning: invalid value encountered in double_scalars
* (1 - self.rsquared))
/home/chasel88/.local/lib/python3.7/site-packages/statsmodels/regression/linear_model.py:1665: RuntimeWarning: divide by zero encountered in double_scalars
return self.ssr/self.df_resid
/home/chasel88/.local/lib/python3.7/site-packages/statsmodels/regression/linear_model.py:1578: RuntimeWarning: divide by zero encountered in double_scalars
return np.dot(wresid, wresid) / self.df_resid
OLS Regression Results
==============================================================================
Dep. Variable: Reverse Log R-squared: 1.000
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: 0.000
Date: Di, 09 Jul 2019 Prob (F-statistic): nan
Time: 16:36:58 Log-Likelihood: 31546.
No. Observations: 1097 AIC: -6.090e+04
Df Residuals: 0 BIC: -5.541e+04
Df Model: 1096
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Measurement1 2.1712 inf 0 nan nan nan
Measurement2 -0.1254 inf -0 nan nan nan
Measurement3 -1.0199 inf -0 nan nan nan
Measurement4 2.4232 inf 0 nan nan nan
Measurement5 0.7925 inf 0 nan nan nan
Measurement6 -0.6553 inf -0 nan nan nan
它没有显示y截距,但是当我在sklearn中运行此数据时,我得到了截距。我只是想使用statsmodels,以便我可以得到p值作为系数。除了丢失的截距外,“ nan”还写在各处,并表示存在零除错误。有人知道问题出在哪里吗?
答案 0 :(得分:1)
同时为sklearn
和statsmodels
回归提供mcve很有帮助。
撇开使用2,000个变量进行回归的优点,看来您的输入数据可能包含一列常量。 sm.add_constant()
的帮助页面说明:
has_constant : str {'raise', 'add', 'skip'} Behavior if ``data`` already has a constant. The default will return data without adding another constant. If 'raise', will raise an error if a constant is present. Using 'add' will duplicate the constant, if one is present.
np.random.seed(42)
df = pd.DataFrame({'x1':np.random.rand(20) // .1,
'x2':np.random.rand(20) // .01,
'x3':np.random.rand(20) // .01,
'y':np.random.rand(20) // .01})
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X2 = sm.add_constant(X)
model = sm.OLS(y, X2).fit()
print(model.summary())
返回
const 23.7669 24.751 0.960 0.351 -28.702 76.236
x1 1.1993 2.943 0.408 0.689 -5.039 7.438
x2 0.4973 0.327 1.523 0.147 -0.195 1.190
x3 -0.1122 0.231 -0.486 0.634 -0.602 0.377
如果数据集中已经有一个常数项,则sm.add_constant()
将运行,不返回任何消息,并且不添加常数。在下面的示例中,常数是1
以外的其他值,因此回归输出中k
的参数与上面的正常情况不同。
np.random.seed(42)
df = pd.DataFrame({'x1':np.random.rand(20) // .1,
'x2':np.random.rand(20) // .01,
'x3':np.random.rand(20) // .01,
'k':list([15])*20,
'y':np.random.rand(20) // .01})
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X2 = sm.add_constant(X)
model = sm.OLS(y, X2).fit()
print(model.summary())
返回
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 1.1993 2.943 0.408 0.689 -5.039 7.438
x2 0.4973 0.327 1.523 0.147 -0.195 1.190
x3 -0.1122 0.231 -0.486 0.634 -0.602 0.377
k 1.5845 1.650 0.960 0.351 -1.913 5.082