具有特定测试大小的交叉验证

时间:2016-02-09 08:01:47

标签: python scikit-learn

我之前使用cross_validation.train_test_split将我的数据集拆分为90:10的比例。我现在搬到了Stratified Shuffle Split(Kfold和Shuffle Split合并为scikit-learn)。我想了解是否使用指定的测试大小进行分层划分更好,还是应该在不指定测试大小的情况下进行划分?

这就是我在做的事情:

train=[]
with open("/Users/minks/Documents/documents.txt") as f:
    for line in f:
        train.append(line.strip().split())
train=np.array(train)
labels=[]
with open("/Users/minks/Documents/Labels.txt") as t:
    for line in t:
        labels.extend(line.strip().split())
labels=np.array(labels)

kf=StratifiedShuffleSplit(labels, n_iter=5, test_size=0.10)

for train_index, test_index in kf:
     X_train, X_test = train[train_index],train[test_index]
     Y_train, Y_test = labels[train_index],labels[test_index]

我想知道指定一个test_size是否是一个很好的性能决定,因为如果我不这样做,它会选择随机比率。

1 个答案:

答案 0 :(得分:0)

如果您未指定自己的测试尺寸,则默认为0.1。它不会使用随机比率。您可以在docs中找到默认值(tring of the function):

在IPython中,做

[1]: from sklearn.cross_validation import StratifiedShuffleSplit
[2]: StratifiedShuffleSplit?

你会看到

[...]
Parameters
----------
n : int
    Total number of elements in the dataset.

n_iter : int (default 10)
    Number of re-shuffling & splitting iterations.

test_size : float (default 0.1), int, or None
    If float, should be between 0.0 and 1.0 and represent the
    proportion of the dataset to include in the test split. If
    int, represents the absolute number of test samples. If None,
    the value is automatically set to the complement of the train size.

[...]