Statsmodels Python使用少一个预测变量预测线性回归

时间:2017-05-29 14:59:47

标签: python regression statsmodels predict

我培训了一个线性回归模型,该模型包含一年内数据集的20个预测变量。下面是x20,它是一个数组列表,每个数组都是一个预测器,可以输入到线性回归中。 y是我适合的观察结果,模型是得到的线性回归模型。观察和预测器正在训练期间被选中(除了我将验证或预测的最后一天(24小时)之外):

num_verifydays = 1
##############Train MOS model##################
x20=[predictor1[:-(num_verifydays)*24],predictor2[:-(num_verifydays)*24],
predictor3[:-(num_verifydays)*24],predictor4[:-(num_verifydays)*24],
predictor5[:-(num_verifydays)*24],predictor6[:-(num_verifydays)*24],
predictor7[:-(num_verifydays)*24],predictor8[:-(num_verifydays)*24],
predictor9[:-(num_verifydays)*24],predictor10[:-(num_verifydays)*24],
predictor11[:-(num_verifydays)*24],predictor12[:-(num_verifydays)*24],
predictor13[:-(num_verifydays)*24],predictor14[:-(num_verifydays)*24],
predictor15[:-(num_verifydays)*24],predictor16[:-(num_verifydays)*24],
predictor17[:-(num_verifydays)*24],predictor18[:-(num_verifydays)*24],
predictor19[:-(num_verifydays)*24],predictor20[:-(num_verifydays)*24]]

x20 = np.asarray(x20).T.tolist()

y = result_full['obs'][:-(num_verifydays)*24]

model = sm.OLS(y,x20, missing='drop').fit()

我想预测在我的验证日使用这个模型使用所有20个预测变量,然后使用19个预测变量来查看使用较少预测变量时技能差异有多大。我尝试将predictor20设置为x19中的零数组,您将在下面看到,但这似乎给我带来了奇怪的结果:

##################predict with regression model##################
x20=[predictor1[-(num_verifydays)*24:],predictor2[-(num_verifydays)*24:],
predictor3[-(num_verifydays)*24:],predictor4[-(num_verifydays)*24:],
predictor5[-(num_verifydays)*24:],predictor6[-(num_verifydays)*24:],
predictor7[-(num_verifydays)*24:],predictor8[-(num_verifydays)*24:],
predictor9[-(num_verifydays)*24:],predictor10[-(num_verifydays)*24:],
predictor11[-(num_verifydays)*24:],predictor12[-(num_verifydays)*24:],
predictor13[-(num_verifydays)*24:],predictor14[-(num_verifydays)*24:],
predictor15[-(num_verifydays)*24:],predictor16[-(num_verifydays)*24:],
predictor17[-(num_verifydays)*24:],predictor18[-(num_verifydays)*24:],
predictor19[-(num_verifydays)*24:],predictor20[-(num_verifydays)*24:]]

x19=[predictor1[-(num_verifydays)*24:],predictor2[-(num_verifydays)*24:],
predictor3[-(num_verifydays)*24:],predictor4[-(num_verifydays)*24:],
predictor5[-(num_verifydays)*24:],predictor6[-(num_verifydays)*24:],
predictor7[-(num_verifydays)*24:],predictor8[-(num_verifydays)*24:],
predictor9[-(num_verifydays)*24:],predictor10[-(num_verifydays)*24:],
predictor11[-(num_verifydays)*24:],predictor12[-(num_verifydays)*24:],
predictor13[-(num_verifydays)*24:],predictor14[-(num_verifydays)*24:],
predictor15[-(num_verifydays)*24:],predictor16[-(num_verifydays)*24:],
predictor17[-(num_verifydays)*24:],predictor18[-(num_verifydays)*24:],
predictor19[-(num_verifydays)*24:],np.zeros(num_verifydays*24)]

x20 = np.asarray(x20).T.tolist()
x19 = np.asarray(x19).T.tolist()

results20 = model.predict(x20)
results19 = model.predict(x19)

1 个答案:

答案 0 :(得分:1)

你应该适应两个不同的模型,一个有19个外生变量,另一个有20个。这比在19变量集上测试20变量模型更加统计,因为拟合系数会不同。

model19 = sm.OLS(y,x19, missing='drop').fit()
model20 = sm.OLS(y,x20, missing='drop').fit()

您的数据频率是多少?使用1天(n = 1)的测试数据集不会让您真正了解变量重要性。

查看此变量重要性的其他方法是查看两个模型之间增加或减少的增量R平方。

另请考虑查看sklearn的{​​{3}}功能。