Question

我有一个这样的数据框（它更大，功能更多）：

        Date  Influenza[it]  Febbre[it]  Cefalea[it]  Paracetamolo[it]  \
0    2008-01            989        2395         1291              2933   
1    2008-02            962        2553         1360              2547   
2    2008-03           1029        2309         1401              2735   
3    2008-04           1031        2399         1137              2296   
       ...              ...

     tot_incidence  
0           4.56  
1           5.98  
2           6.54  
3           6.95  
            ....

首先，我对数据帧进行了ols回归，没有在训练/测试集中进行分割，这是有效的“输入配置”（tot_incidence是预测，Influenza[it]，{{1 }和Febbre[it]是功能）：

Cefalea[it]

行。现在我想做一个训练和测试集。

尝试经典拆分和k折

1°Classic split

可能这更容易，我可以这样做：

fin1=fin1.rename(columns = {'tot_incidence':'A','Influenza[it]':'B', 'Febbre[it]':'C','Cefalea[it]':'D'})
result = sm.ols(formula="A ~ B + C + D", data=fin1).fit()

然后在OLS模型中插入变量：

X_train, X_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.3, random_state=1)

在这种情况下，我如何从数据框中将x_train = sm.add_constant(X_train) model = sm.OLS(y_train, x_train) results = model.fit() predictions = results.predict(X_test)插入到x,y函数中？

2°K倍（如果太硬，不要浪费时间）

例如，我可以这样做：

cross_validation.train_test_split

此时我卡住了，我如何在ols中插入此变量以进行预测？是否有更好的方法来制作培训/测试集？

Answer 1

在这种情况下，我如何从数据帧中将x，y插入到cross_validation.train_test_split函数中？

您需要将数据帧列转换为算法可以理解的输入（x,y），即根据您尝试执行的算法类型将数据帧的列转换为数字或类别。

1）选择数据框中作为响应/预测变量的变量，即Y变量。说那是Influenza：
y = df.Influenze.values # convert to a numpy array

2）选择X变量，比如Febbre, Cefalea, Paracetamolo：
X = np.column_stack([df.Febbre.values, df.Cefalea.values, df.Paracetamolo.values])

现在您可以调用cross_validation.train_test_split函数。

请注意，如果您的变量是类别，那么您必须使用某种分类，例如one-hot。

对OLS回归模型的数据框进行交叉验证

1 个答案: