#I have imported the dataset with pandas
df = pd.read_csv(filename)
####Preparing data for sklearn
#1)Dropped the names of each sample
df.drop(['id'], 1, inplace=True)
#2)Isolate data and remove column with classification (X) and isolation classification column (y)
X = np.array(df.drop(['class'],1))
y = np.array(df['class'])
######
#Split data into testing/training datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y,test_size=0.4)
问题:如果我想要测试/训练数据中的样本名称(测试后),我该如何检索它们?
答案 0 :(得分:2)
如果您将id
索引设为df
,则会在运行train_test_split
后保留索引值。
首先,让我们生成一些示例数据:
import numpy as np
import pandas as pd
N = 10
ids = ['a','b','c','d','e','f','g','h','i','j']
values = np.random.random(N)
classes = np.random.binomial(n=1,p=.5,size=N)
df = pd.DataFrame({'id':ids,'predictor':values,'label':classes})
然后明确将id
设置为索引:
df.set_index('id', inplace=True)
现在df
看起来像这样:
label predictor
id
a 1 0.214636
b 0 0.466477
c 1 0.300480
d 1 0.378645
e 0 0.755834
f 1 0.506719
g 0 0.948360
h 0 0.736498
i 1 0.058591
j 1 0.997003
使用Pandas对象拆分到训练集/测试集将保留其原始索引值:
X = df.predictor
y = df.label
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
print(X_train)
id
a 0.214636
b 0.466477
d 0.378645
j 0.997003
i 0.058591
f 0.506719
Name: predictor, dtype: float64