如何将DataFrame或Matrix拆分为值而不是行的训练集和测试集?

时间:2015-07-08 17:15:53

标签: python numpy pandas scikit-learn

我有一个稀疏矩阵G,其值(非纳米值)需要分成测试/训练集。 sklearn中的test_train_split函数在行上拆分,但我希望它在实际索引上拆分。这大致是我试图做的事情

 1. test, train = split non-nan values in G (80/20-train/test)
 2. test_and_nan = combine test and nan sets
 3. G2 = G.copy()
 4. G2[ test_and_nan ] = 0 // initialize to 0 before imputing
 5. 
 6. do until norm(G2, frobenius) doesnt change much from last iteration
 7.     S,C = nmf(G2)
 8.     // use nmf decomposition to impute test_and_nan values
 9.     G2[ test_and_nan ] = (S*C)[ test_and_nan ]
10.
11. compute rmse( G[test] - G2[test] )

我想使用布尔掩码来选择索引,但我不知道该怎么做。任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:1)

您可以使用具有数据大小的随机向量(数字元素)来拆分索引。 像这样:

TRAIN_SIZE = 0.80
# Create boolean mask
# np.random creates a vector of random values between 0 and 1
# Those values are filtered to create a binary mask
msk = np.random.rand(len(data)) < TRAIN_SIZE

train = data[msk]  
test = data[~msk]  # inverse of boolean mask