如何将稀疏矩阵分为训练集和测试集?

时间:2019-09-09 20:21:42

标签: python numpy scikit-learn sparse-matrix

我想了解如何处理稀疏矩阵。我有这段代码可以将多标签分类数据集生成为稀疏矩阵。

from sklearn.datasets import make_multilabel_classification

X, y = make_multilabel_classification(sparse = True, n_labels = 20, return_indicator = 'sparse', allow_unlabeled = False)

此代码以以下格式给我X:

<100x20 sparse matrix of type '<class 'numpy.float64'>' 
with 1797 stored elements in Compressed Sparse Row format>

y:

<100x5 sparse matrix of type '<class 'numpy.int64'>'
with 471 stored elements in Compressed Sparse Row format>

现在,我需要将X和y拆分为X_train,X_test,y_train和y_test,以便训练集占70%。我该怎么办?

这是我尝试过的:

X_train, X_test, y_train, y_test = train_test_split(X.toarray(), y, stratify=y, test_size=0.3)

并收到错误消息:

  

TypeError:通过了稀疏矩阵,但是需要密集数据。采用   X.toarray()转换为密集的numpy数组。

2 个答案:

答案 0 :(得分:1)

错误消息本身似乎建议解决方案。需要将Xy都转换为密集矩阵。

请执行以下操作

X = X.toarray()
y = y.toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

答案 1 :(得分:0)

问题是由于stratify=y引起的。如果您查看train_test_split的文档,我们可以看到

*arrays

  • 允许的输入为列表,numpy数组,稀疏矩阵或熊猫数据框。

stratify

  • 类数组 (不提及稀疏矩阵)

不幸的是,即使将此数据集强制转换为密集数组,它也无法与stratify配合使用:

>>> X_tr, X_te, y_tr, y_te = train_test_split(X, y, stratify=y.toarray(), test_size=0.3)
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.