拟合模型时的ValueError

时间:2015-09-14 16:46:28

标签: python numpy pandas statsmodels

我正在运行此代码只是为了检查线性回归模型在python中的工作原理:

import pandas as pd
import numpy as np
import statsmodels.api as sm

train = pd.read_csv('data/train.csv', parse_dates=[0])
test = pd.read_csv('data/test.csv', parse_dates=[0])

print train.head()

#Feature engineering
temp_train = pd.DatetimeIndex(train['datetime'])
train['year'] = temp_train.year
train['month'] = temp_train.month
train['hour'] = temp_train.hour
train['weekday'] = temp_train.weekday

temp_test = pd.DatetimeIndex(test['datetime'])
test['year'] = temp_test.year
test['month'] = temp_test.month
test['hour'] = temp_test.hour
test['weekday'] = temp_test.weekday

#Define features vector
features = ['season', 'holiday', 'workingday', 'weather',
            'temp', 'atemp', 'humidity', 'windspeed', 'year',
            'month', 'weekday', 'hour']

#The evaluation metric is the RMSE in the log domain,
#so we should transform the target columns into log domain as well.
for col in ['casual', 'registered', 'count']:
    train['log-' + col] = train[col].apply(lambda x: np.log1p(x))

#Split train data set into training and validation sets
training, validation = train[:int(0.8*len(train))], train[int(0.8*len(train)):]

# Create a linear model
X = sm.add_constant(training[features])
model = sm.OLS(training['log-count'],X) # OLS stands for Ordinary Least Squares
f = model.fit()

ypred = f.predict(sm.add_constant(validation[features]))
print(ypred)

plt.figure();
plt.plot(validation[features], ypred, 'o', validation[features], validation['log-count'], 'b-');
plt.title('blue: true,   red: OLS');

弹出以下错误消息。它是什么意思以及如何解决它?

Traceback (most recent call last):
  File "C:/TestModel/linear_regression.py", line 99, in <module>
    ypred = f.predict(sm.add_constant(validation[features]))
  File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 749, in predict
    return self.model.predict(self.params, exog, *args, **kwargs)
  File "C:\Python27\lib\site-packages\statsmodels\regression\linear_model.py", line 359, in predict
    return np.dot(exog, params)
ValueError: shapes (2178,12) and (13,) not aligned: 12 (dim 1) != 13 (dim 0)

这是数据样本:

print training.head()
             datetime  season  holiday  workingday  weather  temp   atemp  \
0 2011-01-01 00:00:00       1        0           0        1  9.84  14.395   
1 2011-01-01 01:00:00       1        0           0        1  9.02  13.635   
2 2011-01-01 02:00:00       1        0           0        1  9.02  13.635   
3 2011-01-01 03:00:00       1        0           0        1  9.84  14.395   
4 2011-01-01 04:00:00       1        0           0        1  9.84  14.395   

   humidity  windspeed  casual  registered  count  year  month  hour  weekday  \
0        81          0       3          13     16  2011      1     0        5   
1        80          0       8          32     40  2011      1     1        5   
2        80          0       5          27     32  2011      1     2        5   
3        75          0       3          10     13  2011      1     3        5   
4        75          0       0           1      1  2011      1     4        5   

   log-casual  log-registered  log-count  
0    1.386294        2.639057   2.833213  
1    2.197225        3.496508   3.713572  
2    1.791759        3.332205   3.496508  
3    1.386294        2.397895   2.639057  
4    0.000000        0.693147   0.693147  


print validation.head()
                datetime  season  holiday  workingday  weather   temp   atemp  \
8708 2012-08-05 05:00:00       3        0           0        1  29.52  34.850   
8709 2012-08-05 06:00:00       3        0           0        1  29.52  34.850   
8710 2012-08-05 07:00:00       3        0           0        1  30.34  35.605   
8711 2012-08-05 08:00:00       3        0           0        1  31.16  36.365   
8712 2012-08-05 09:00:00       3        0           0        1  32.80  38.635   

      humidity  windspeed  casual  registered  count  year  month  hour  \
8708        74    16.9979       1          18     19  2012      8     5   
8709        79    16.9979       7          12     19  2012      8     6   
8710        74    19.9995      18          50     68  2012      8     7   
8711        66    22.0028      27          81    108  2012      8     8   
8712        59    23.9994      61         168    229  2012      8     9   

      weekday  log-casual  log-registered  log-count  
8708        6    0.693147        2.944439   2.995732  
8709        6    2.079442        2.564949   2.995732  
8710        6    2.944439        3.931826   4.234107  
8711        6    3.332205        4.406719   4.691348  
8712        6    4.127134        5.129899   5.438079  

1 个答案:

答案 0 :(得分:2)

这看起来像这个用例的predict函数的设计问题。

来自docstring的

”     对于ndarrays和pandas.DataFrames,检查以确保常量不是     已包括在内。如果至少有一列,那么     返回原始对象。 “

http://statsmodels.sourceforge.net/devel/_modules/statsmodels/tools/tools.html#add_constant

我认为这是以这种方式定义的,以避免用于估计的奇异设计矩阵,但validation也适用于奇异矩阵。

我的猜测是,您的add_constant数据有一列具有相同的值,例如它们可能都来自同一年。 如果这是故意的,那么您需要手动将常量添加到数据帧。

如果[DisplayFormat(ApplyFormatInEditMode = true, DataFormatString = "{0:yyyy-MMMMM-dd}")] 可以选择转换此行为,那会更好。