Question

我有一个场景，我有一个数据帧和词汇表文件，我试图适应数据帧字符串列。我正在使用scikit学习countVectorizer，它产生一个稀疏矩阵。我需要获取稀疏矩阵的输出并将其与数据帧中的相应行的数据帧合并。

代码： -

from sklearn.feature_extraction.text import CountVectorizer
docs = ["You can catch more flies with honey than you can with vinegar.",
         "You can lead a horse to water, but you can't make him drink.",
        "search not cleaning up on hard delete",
        "updating firmware version failed",
        "increase not service topology s memory",
        "Nothing Matching Here"
       ]
vocabulary = ["catch more","lead a horse", "increase service", "updating" , "search", "vinegar", "drink", "failed", "not"]

vectorizer = CountVectorizer(analyzer=u'word', vocabulary=vocabulary,lowercase=True,ngram_range=(0,19))

SpraseMatrix = vectorizer.fit_transform(docs)

Below is sparse matrix output - 
  (0, 0)    1
  (0, 5)    1
  (1, 6)    1
  (2, 4)    1
  (2, 8)    1
  (3, 3)    1
  (3, 7)    1
  (4, 8)    1

现在，我要做的是从稀疏矩阵为每一行构建一个字符串，并将其添加到相应的文档中。

Ex： - 对于doc 3（“更新固件版本失败”），我希望从稀疏矩阵（即更新和失败的列索引及其频率）获得“3：1 7：1”并将其添加到doc的数据框第3行。

我在下面尝试过，它会产生展平输出，因为我希望根据行索引获取子矩阵，循环遍历它并为每行构建一个串联字符串，如“3：1 7：1”，以及最后，将此字符串作为新列添加到每个相应行的数据框中。

cx = SpraseMatrix .tocoo()
for i,j,v in zip(cx.row, cx.col, cx.data):
        print((i,j,v))

(0, 0, 1)
(0, 5, 1)
(1, 6, 1)
(2, 4, 1)
(2, 8, 1)
(3, 3, 1)
(3, 7, 1)
(4, 8, 1)

Answer 1

我并不完全按照您的意愿行事，但lil格式可能更容易使用：

In [1122]: M = sparse.coo_matrix(([1,1,1,1,1,1,1,1],([0,0,1,2,2,3,3,4],[0,5,6,4,
      ...: 8,3,7,8])))
In [1123]: M
Out[1123]: 
<5x9 sparse matrix of type '<class 'numpy.int32'>'
    with 8 stored elements in COOrdinate format>
In [1124]: print(M)
  (0, 0)    1
  (0, 5)    1
  (1, 6)    1
  (2, 4)    1
  (2, 8)    1
  (3, 3)    1
  (3, 7)    1
  (4, 8)    1
In [1125]: Ml = M.tolil()
In [1126]: Ml.data
Out[1126]: array([list([1, 1]), list([1]), list([1, 1]), list([1, 1]), list([1])], dtype=object)
In [1127]: Ml.rows
Out[1127]: array([list([0, 5]), list([6]), list([4, 8]), list([3, 7]), list([8])], dtype=object)

它的属性按行组织，看起来就像你想要的那样。

In [1130]: Ml.rows[3]
Out[1130]: [3, 7]

In [1135]: for i,(rd) in enumerate(zip(Ml.rows, Ml.data)):
      ...:     print(' '.join(['%s:%s'%ij for ij in zip(*rd)]))
      ...:      
0:1 5:1
6:1
4:1 8:1
3:1 7:1
8:1

您还可以遍历csr格式的行，但这需要使用.indptr属性进行更多数学运算。

迭代稀疏矩阵并连接每行的数据和指标

1 个答案: