如何测试使用TimeSeriesSplit训练的经过交叉验证的sklearn.linear_model

时间:2020-05-27 22:13:43

标签: python numpy machine-learning scikit-learn time-series

我真的无法掌握如何测试以时间序列方式训练的模型的想法。 就我而言,我有一个每周整数值的数据,应该将其分类为0或1。 训练模型没有问题(至少我认为是这样),但是我很难真正对其进行测试。 以下是一些代码段:

#Imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection._validation import cross_val_predict
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import roc_auc_score
#Training
validation_size = test_size = 0.2
n_splits = int(np.ceil(validation_size * np.size(X_train)))
tscv = TimeSeriesSplit(n_splits)

#data (weekly int value) has shape (1, 1000) and target (0 or 1) has shape (, 1000)
#and are passed to the function
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size = test_size)
clf = LogisticRegressionCV(cv = tscv, random_state = 0, n_jobs = -1).fit(X_train, y_train)

#Testing ----- here I am not sure
y_score = clf.decision_function(X_test)
predicted_probs = clf.predict_proba(X_test)
positives = predicted_probs[:, 1]
auc = roc_auc_score(y_test, positives)
fpr, tpr, _ = roc_curve(y_test, positives)

plt.plot(fpr, tpr, linestyle = '.', label = 'LogRegCV, AUC:' + str(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

测试也不应该按时间顺序进行吗?如果是的话,最方便的方法是什么?我找不到任何东西。我已经尝试创建一个数组数组,其中包含从第一天到最后一天的所有可能间隔,并将其传递给clf.decision_function(X_test_interval)predicted_probs = clf.predict_proba(X_test)。但是显然我做错了,否则sklearn无法处理此类数据,因为我收到以下错误:setting an array element with a sequence

我正在使用:python 3.6.6; numpy 1.18.1; scikit-learn 0.22.1; matplotlib 3.1.3

非常感谢您的帮助!

0 个答案:

没有答案