在Python sklearn中,如何检索测试/训练数据中的样本/变量名称?

时间:2017-05-12 21:06:11

标签: python pandas scipy scikit-learn

#I have imported the dataset with pandas
df = pd.read_csv(filename)
####Preparing data for sklearn
#1)Dropped the names of each sample
df.drop(['id'], 1, inplace=True)
#2)Isolate data and remove column with classification (X) and isolation classification column (y)
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
######
#Split data into testing/training datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y,test_size=0.4)

问题:如果我想要测试/训练数据中的样本名称(测试后),我该如何检索它们?

1 个答案:

答案 0 :(得分:2)

如果您将id索引设为df,则会在运行train_test_split后保留索引值。 首先,让我们生成一些示例数据:

import numpy as np
import pandas as pd

N = 10
ids = ['a','b','c','d','e','f','g','h','i','j']
values = np.random.random(N)
classes = np.random.binomial(n=1,p=.5,size=N)
df = pd.DataFrame({'id':ids,'predictor':values,'label':classes})

然后明确将id设置为索引:

df.set_index('id', inplace=True)

现在df看起来像这样:

    label  predictor
id                  
a       1   0.214636
b       0   0.466477
c       1   0.300480
d       1   0.378645
e       0   0.755834
f       1   0.506719
g       0   0.948360
h       0   0.736498
i       1   0.058591
j       1   0.997003

使用Pandas对象拆分到训练集/测试集将保留其原始索引值:

X = df.predictor
y = df.label

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

print(X_train)
id
a    0.214636
b    0.466477
d    0.378645
j    0.997003
i    0.058591
f    0.506719
Name: predictor, dtype: float64