使用train_test_split的一个命令创建数据集的多个分裂

时间:2012-11-12 15:16:46

标签: python numpy machine-learning scikit-learn

  • 我的数据集有42000
  • 我需要将数据集划分为training, cross-validation and test个集合,其分割为60%, 20% and 20%。这是根据Andrew Ng教授在他的ml级讲座中的建议。
  • 我意识到scikit-learn有一个方法train_test_split来做这件事。但我无法使它工作,所以我在一个班轮命令
  • 中得到了0.6, 0.2, 0.2的分裂

我的工作是

# split data into training, cv and test sets
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)


# preparing the training dataset
print 'training shape(Tuple of array dimensions) = ', train.shape
print 'training dimension(Number of array dimensions) = ', train.ndim
print 'cv shape(Tuple of array dimensions) = ', cv.shape
print 'cv dimension(Number of array dimensions) = ', cv.ndim
print 'test shape(Tuple of array dimensions) = ', test.shape
print 'test dimension(Number of array dimensions) = ', test.ndim

并告诉我

的结果
training shape(Tuple of array dimensions) =  (25200, 785)
training dimension(Number of array dimensions) =  2
cv shape(Tuple of array dimensions) =  (8400, 785)
cv dimension(Number of array dimensions) =  2
test shape(Tuple of array dimensions) =  (8400, 785)
test dimension(Number of array dimensions) =  2
features shape =  (25200, 784)
labels shape =  (25200,)

如何在一个命令中完成这项工作?

1 个答案:

答案 0 :(得分:1)

阅读train_test_split及其随附类ShuffleSplit的源代码,并根据您的使用情况进行调整。它不是一个很大的功能,它应该不是很复杂。