我有一个稀疏矩阵G,其值(非纳米值)需要分成测试/训练集。 sklearn中的test_train_split函数在行上拆分,但我希望它在实际索引上拆分。这大致是我试图做的事情
1. test, train = split non-nan values in G (80/20-train/test)
2. test_and_nan = combine test and nan sets
3. G2 = G.copy()
4. G2[ test_and_nan ] = 0 // initialize to 0 before imputing
5.
6. do until norm(G2, frobenius) doesnt change much from last iteration
7. S,C = nmf(G2)
8. // use nmf decomposition to impute test_and_nan values
9. G2[ test_and_nan ] = (S*C)[ test_and_nan ]
10.
11. compute rmse( G[test] - G2[test] )
我想使用布尔掩码来选择索引,但我不知道该怎么做。任何帮助将不胜感激。
答案 0 :(得分:1)
您可以使用具有数据大小的随机向量(数字元素)来拆分索引。 像这样:
TRAIN_SIZE = 0.80
# Create boolean mask
# np.random creates a vector of random values between 0 and 1
# Those values are filtered to create a binary mask
msk = np.random.rand(len(data)) < TRAIN_SIZE
train = data[msk]
test = data[~msk] # inverse of boolean mask