具有虚拟变量Python的Statsmodels OLS函数

时间:2018-12-09 05:42:09

标签: python regression linear-regression

我正在尝试使用分类变量创建回归。

我首先获取所有虚拟变量。并将不需要的所有内容都放在

的x值中
d1 = pd.get_dummies(df2015 ["CBSA Office"])
df2015_new = pd.concat([df2015, d1], axis=1)
d2 = pd.get_dummies(df2016 ["CBSA Office"])
df2016_new = pd.concat([df2016, d2], axis=1)
trainset = pd.concat([df2015_new,df2016_new],axis=0)
trainset = trainset.dropna()
x_train = trainset.drop(['CBSA Office','Location','Updated','Commercial Flow','Travellers Flow'],axis="columns")
y_train = trainset["Travellers Flow"]

现在我正在使用OLS函数运行回归。

x_train = x_train.iloc[:100].values.reshape(-1,1)
y_train = y_train.iloc[:100].values.reshape(-1,1)
modelx = sm.OLS(y_train.astype(float), x_train.astype(float)).fit()
modelx.summary()

然后我会收到一条错误消息

endog and exog matrices are different sizes

但是我想我已经设置了相同的大小

如果不重塑它们,我会得到这样的结果


C:\Users\CiCi\Anaconda3-1\lib\site-packages\statsmodels\regression\linear_model.py:1554: RuntimeWarning: invalid value encountered in double_scalars
  return self.ess/self.df_model
C:\Users\CiCi\Anaconda3-1\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
C:\Users\CiCi\Anaconda3-1\lib\site-packages\scipy\stats\_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
C:\Users\CiCi\Anaconda3-1\lib\site-packages\scipy\stats\_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)
C:\Users\CiCi\Anaconda3-1\lib\site-packages\statsmodels\base\model.py:1100: RuntimeWarning: invalid value encountered in true_divide
  return self.params / self.bse
OLS Regression Results
Dep. Variable:  Travellers Flow R-squared:  0.000
Model:  OLS Adj. R-squared: 0.000
Method: Least Squares   F-statistic:    nan
Date:   Sun, 09 Dec 2018    Prob (F-statistic): nan
Time:   00:34:01    Log-Likelihood: -429.08
No. Observations:   100 AIC:    860.2
Df Residuals:   99  BIC:    862.8
Df Model:   0       
Covariance Type:    nonrobust       
coef    std err t   P>|t|   [0.025  0.975]
Abbotsford-Huntingdon   8.5000  1.776   4.786   0.000   4.976   12.024
Aldergrove  0   0   nan nan 0   0
Ambassador Bridge   0   0   nan nan 0   0
Blue Water Bridge   0   0   nan nan 0   0
Boundary Bay    0   0   nan nan 0   0
Cornwall    0   0   nan nan 0   0
Coutts  0   0   nan nan 0   0
Douglas (Peace Arch)    0   0   nan nan 0   0
Edmundston  0   0   nan nan 0   0
Emerson 0   0   nan nan 0   0
Fort Frances Bridge 0   0   nan nan 0   0
North Portal    0   0   nan nan 0   0
Pacific Highway 0   0   nan nan 0   0
Peace Bridge    0   0   nan nan 0   0
Prescott    0   0   nan nan 0   0
Queenston-Lewiston Bridge   0   0   nan nan 0   0
Rainbow Bridge  0   0   nan nan 0   0
Sault Ste. Marie    0   0   nan nan 0   0
St-Armand/Philipsburg   0   0   nan nan 0   0
St-Bernard-de-Lacolle   0   0   nan nan 0   0
St. Stephen 0   0   nan nan 0   0
St. Stephen 3rd Bridge  0   0   nan nan 0   0
Stanstead   0   0   nan nan 0   0
Thousand Islands Bridge 0   0   nan nan 0   0
Windsor and Detroit Tunnel  0   0   nan nan 0   0
Woodstock Road  0   0   nan nan 0   0
Omnibus:    81.245  Durbin-Watson:  0.324
Prob(Omnibus):  0.000   Jarque-Bera (JB):   453.220
Skew:   2.832   Prob(JB):   3.84e-99
Kurtosis:   11.757  Cond. No.   1.00e+16


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 9.98e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

这是我想要的格式,其中包含所有虚拟变量,但是有很多警告,R ^ 2为0,并且我肯定不能以此为基础进行预测。

我想要的是一个总结,其中包括每个虚拟变量

我试图这样做

x_train = np.array(x_train).reshape(1,-1)
y_train = np.array(y_train).reshape(1,-1)
modelx = sm.OLS(y_train.astype(float), x_train.astype(float)).fit()
modelx.summary()

我会得到

MemoryError                               Traceback (most recent call last)
<ipython-input-668-312de7f7e808> in <module>()
      1 x_train = np.array(x_train).reshape(1,-1)
      2 y_train = np.array(y_train).reshape(1,-1)
----> 3 modelx = sm.OLS(y_train.astype(float), x_train.astype(float)).fit()
      4 modelx.summary()

~\Anaconda3-1\lib\site-packages\statsmodels\regression\linear_model.py in fit(self, method, cov_type, cov_kwds, use_t, **kwargs)
    273                 self.pinv_wexog, singular_values = pinv_extended(self.wexog)
    274                 self.normalized_cov_params = np.dot(
--> 275                     self.pinv_wexog, np.transpose(self.pinv_wexog))
    276 
    277                 # Cache these singular values for use later.

MemoryError: 

我是python的新手,需要很多帮助,谢谢!

0 个答案:

没有答案