Python Scikit-学习:使用多索引进行交叉验证

时间:2018-12-03 10:33:42

标签: python pandas scikit-learn

嗨,我想使用scikit Learn的功能之一进行交叉验证。我想要的是褶皱的分裂是由索引之一决定的。例如,假设我有以下数据,其中“ month”和“ day”为索引:

Month    Day   Feature_1 
January   1      10
          2      20
February  1      30 
          2      40 
March     1      50 
          2      60 
          3      70 
April     1      80 
          2      90 

让我们说我想将1/4的数据作为每个验证的测试集。我希望通过第一个索引(即月份)完成折页分离。在这种情况下,测试集将是一个月,剩下的3个月将是训练集。例如,训练和测试拆分之一将如下所示:

TEST SET:
Month    Day   Feature_1 
January   1      10
          2      20

TRAINING SET:
Month    Day   Feature_1 
February  1      30 
          2      40 
March     1      50 
          2      60 
          3      70 
April     1      80 
          2      90 

我该怎么做。谢谢。

2 个答案:

答案 0 :(得分:1)

使用-

indices = df.index.levels[0]

train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False)
test_indices = np.setdiff1d(indices, train_indices)

train = df[np.in1d(df.index.get_level_values(0), train_indices)]
test = df[np.in1d(df.index.get_level_values(0), test_indices)]

输出

火车

              Feature_1
Month    Day           
January  1           10
         2           20
February 1           30
         2           40
March    1           50
         2           60
         3           70

测试

           Feature_1
Month Day           
April 1           80
      2           90

说明

indices = df.index.levels[0]提取level=0索引-Index(['April', 'February', 'January', 'March'], dtype='object', name='Month')

中的所有唯一值

train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False)对上一步中选择的75%的索引进行采样

接下来,我们获得剩余的索引为test_indices

最后,我们拆分火车并进行相应的测试

答案 1 :(得分:1)

这称为按组拆分。查看user-guide in scikit-learn here to understand more about it

  

...

     

要对此进行衡量,我们需要确保   验证折叠来自完全没有代表的组   配对的训练褶皱。

     

...

您可以使用GroupKFold或其他名称为“ Group”的策略。样品可以是

# I am not sure about this exact command, 
# but after this, you should have individual columns for each index
df = df.reset_index()  

print(df)
Month     Day    Feature_1
January    1           10
January    2           20
February   1           30
February   2           40
March      1           50
March      2           60
March      3           70

groups = df['Month']

from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
    # Here "train", "test" are indices of location, 
    # you need to use "iloc" to get actual values
    print("%s %s" % (train, test))  

    print(df.iloc[train, :])
    print(df.iloc[test, :])  

更新:要将其传递给交叉验证方法,只需将月份数据传递给其中的groups参数。如下所示:

gkf = GroupKFold(n_splits=3)
y_pred = cross_val_predict(estimator, X_train, y_train, cv=gkf, groups=df['Month'])