使用train_test_split()时如何获取数据的原始索引?
我所拥有的是以下
from sklearn.cross_validation import train_test_split
import numpy as np
data = np.reshape(np.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels
x1, x2, y1, y2 = train_test_split(data, labels, size=0.2)
但这并没有给出原始数据的索引。
一种解决方法是将索引添加到数据(例如data = [(i, d) for i, d in enumerate(data)]
),然后在train_test_split
内传递它们,然后再次展开。
有没有更清洁的解决方案?
答案 0 :(得分:64)
你可以像Julien所说的那样使用pandas数据帧或系列,但是如果你想将你自己限制为numpy,你可以传递一个额外的索引数组:
from sklearn.model_selection import train_test_split
import numpy as np
n_samples, n_features, n_classes = 10, 2, 2
data = np.random.randn(n_samples, n_features) # 10 training examples
labels = np.random.randint(n_classes, size=n_samples) # 10 labels
indices = np.arange(n_samples)
x1, x2, y1, y2, idx1, idx2 = train_test_split(
data, labels, indices, test_size=0.2)
答案 1 :(得分:27)
Scikit学习与熊猫的比赛非常好,所以我建议你使用它。这是一个例子:
In [1]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
data = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
labels = np.random.randint(2, size=10) # 10 labels
In [2]:
X = pd.DataFrame(data)
y = pd.Series(labels)
In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=test_size,
random_state=0)
In [4]: X_test
Out[4]:
0 1
2 -1.39 -1.86
8 0.48 -0.81
4 -0.10 -1.83
In [5]: y_test
Out[5]:
2 1
8 1
4 1
dtype: int32
您可以直接调用DataFrame / Series上的任何scikit函数,它将起作用。
假设您想要进行LogisticRegression,以下是如何以一种很好的方式检索系数:
In [6]:
from sklearn.linear_model import LogisticRegression
model = linear_model.LogisticRegression()
model = model.fit(X_train, y_train)
# Retrieve coefficients: index is the feature name ([0,1] here)
df_coefs = pd.DataFrame(model.coef_[0], index=X.columns, columns = ['Coefficient'])
df_coefs
Out[6]:
Coefficient
0 0.076987
1 -0.352463
答案 2 :(得分:1)
docs提到train_test_split只是一个便利功能,而不是随机播放。
我只是重新安排了一些代码来制作我自己的例子。请注意,实际的解决方案是中间的代码块。其余的是导入,并为可运行的示例设置。
from sklearn.model_selection import ShuffleSplit
from sklearn.utils import safe_indexing, indexable
from itertools import chain
import numpy as np
X = np.reshape(np.random.randn(20),(10,2)) # 10 training examples
y = np.random.randint(2, size=10) # 10 labels
seed = 1
cv = ShuffleSplit(random_state=seed, test_size=0.25)
arrays = indexable(X, y)
train, test = next(cv.split(X=X))
iterator = list(chain.from_iterable((
safe_indexing(a, train),
safe_indexing(a, test),
train,
test
) for a in arrays)
)
X_train, X_test, train_is, test_is, y_train, y_test, _, _ = iterator
print(X)
print(train_is)
print(X_train)
现在我有了实际的索引:train_is, test_is
答案 3 :(得分:1)
这是最简单的解决方案(Jibwa在另一个答案中看起来很复杂),而不必自己生成索引-只需使用ShuffleSplit对象生成1个拆分即可。
import numpy as np
from sklearn.model_selection import ShuffleSplit # or StratifiedShuffleSplit
sss = ShuffleSplit(n_splits=1, test_size=0.1)
data_size = 100
X = np.reshape(np.random.rand(data_size*2),(data_size,2))
y = np.random.randint(2, size=data_size)
sss.get_n_splits(X, y)
train_index, test_index = next(sss.split(X, y))
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
答案 4 :(得分:0)
如果您使用的是熊猫,则可以通过调用要模拟的任何数组的.index来访问索引。 train_test_split将熊猫索引带到新的数据帧。
您只需在代码中使用
x1.index
返回的数组是与x中原始位置有关的索引。