我在sklearn cross_validation train_test_split模块中使用pandas数据帧。
d=pandas.DataFrame({'a':np.random.randn(300),
'c':np.array([el for el in np.ones(100)]+
[el for el in np.zeros(200)])})
from sklearn import cross_validation
(X,y)=(d['a'],d['c'])
这有效
X_train_and_cv, X_test,y_train_and_cv,y_test = sklearn.cross_validation.train_test_split(X,y,test_size=0.2,random_state=0)
X_train, X_cv,y_train,y_cv = sklearn.cross_validation.train_test_split(X_train_and_cv,y_train_and_cv,test_size=0.2,random_state=0)
为什么这不起作用?
X_train_and_cv, X_test,y_train_and_cv,y_test = sklearn.cross_validation.train_test_split(X,y,test_size=0.2,random_state=0,stratify=y)
X_train, X_cv,y_train,y_cv = sklearn.cross_validation.train_test_split(X_train_and_cv,y_train_and_cv,test_size=0.2,random_state=0,stratify=y)
in _is_valid_list_like(self, key, axis)
1536 l = len(ax)
1537 if len(arr) and (arr.max() >= l or arr.min() < -l):
-> 1538 raise IndexError("positional indexers are out-of-bounds")
1539
1540 return True
IndexError: positional indexers are out-of-bounds
答案 0 :(得分:2)
TL; DR:您对train_test_split
的第二次调用对stratify
使用的数组长度与您使用的y
不同。使用stratify=y_train_and_cv
。
首先,一点注意事项:cross_validation
(0.17.1 docs here)很快就会被弃用,您应该使用model_selection.train_test_split (0.18.1)
代替。我将导入train_test_split itself
以缩短后续内容的长度:
# Same as this in older versions:
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
这很好:
X_train_and_cv, X_test,y_train_and_cv,y_test = train_test_split(X,y,
test_size=0.2,
random_state=0,
stratify=y)
这是不正常的,因为y=y_train_and_cv
(len = 240)stratify=y
(len = 300)
X_train, X_cv,y_train,y_cv = train_test_split(X_train_and_cv,
y_train_and_cv,
test_size=0.2,
random_state=0,
stratify=y)
将其替换为:
X_train, X_cv,y_train,y_cv = train_test_split(X_train_and_cv,
y_train_and_cv,
test_size=0.2,
random_state=0,
stratify=y_train_and_cv)