嗨,我有一个以这种方式构建的稀疏csr矩阵:
userid = list(np.sort(matrix.USERID.unique())) # Get our unique customers
artid = list(matrix.ARTID.unique()) # Get our unique products that were purchased
click = list(matrix.TOTALCLICK)
rows = pd.Categorical(matrix.USERID, categories=userid).codes
# Get the associated row indices
cols = pd.Categorical(matrix.ARTID, categories=artid).codes
# Get the associated column indices
item_sparse = sparse.csr_matrix((click, (rows, cols)), shape=(len(userid), len(artid)))
原始matrix
包含用户与网站上产品的互动。
我最终得到了这种格式的稀疏矩阵
(0, 4136) 1
(0, 5553) 1
(0, 9089) 1
(0, 24104) 3
(0, 28061) 2
(1, 0) 2
(1, 224) 1
(1, 226) 1
(1, 324) 2
(1, 341) 1
(1, 530) 1
(1, 642) 1
(1, 658) 1
如何将稀疏矩阵按第一个索引(用户)分组,并说训练集的前80%行和测试集的其他20%行。我应该以两个矩阵结束
培训:
(0, 4136) 1
(0, 5553) 1
(0, 9089) 1
(1, 0) 2
(1, 224) 1
(1, 226) 1
(1, 324) 2
(1, 341) 1
(1, 530) 1
测试:
(0, 24104) 3
(0, 28061) 2
(1, 642) 1
(1, 658) 1
答案 0 :(得分:1)
您可以使用StratifiedShuffleSplit
(或者如果不想改组,也可以使用StratifiedKFold
,但是您需要进行5次拆分才能获得80%/ 20%的训练/测试拆分,例如您无法通过其他方式控制测试大小。)scikit-learn中的类:
import sklearn.model_selection
import numpy as np
# Array similar to your structure
x = np.asarray([[0,4136,1],[0,5553,1],[0,9089,1],[1,0,2], \
[1,224,1],[1,226,1],[1,324,2],[1,341,1],[1,530,1]])
# Get train and test indices using x[:,0] to define the 'classes'
cv = sklearn.model_selection.StratifiedShuffleSplit(n_splits=1, test_size=0.2)
# Note, X isn't actually used in the method, np.zeros(n_samples) would also work
# Also note that cv.split is an iterator with 1 element (split),
# hence getting the first element of the list
train_idx, test_idx = list(cv.split(X=x, y=x[:,0]))[0]
print("Training")
for i in train_idx:
print(x[i,:2], x[i,2])
print("Test")
for i in test_idx:
print(x[i,:2], x[i,2])
我对稀疏矩阵没有太多经验,所以我希望您可以从我的示例中进行必要的调整。
答案 1 :(得分:0)
使用sklearn api train_test_split,您将为矩阵的此方法3个参数分配分裂比率和随机状态。如果您想以相同的结果再次分割,则随机状态非常有用。