Python - csr_matrix的数据结构

时间:2017-08-14 15:59:11

标签: python numpy scipy scikit-learn

我正在研究TFIDF。我使用过 tfidf_vectorizer.fit_transform 。它返回一个csr_matrix,但我无法理解结果的结构。

  • 数据输入:
  

文件=(“天空是蓝色的”,“太阳是明亮的”,“太阳在   天空很明亮“,”我们可以看到灿烂的阳光,灿烂的阳光“)

  • 声明:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
print(tfidf_matrix)
  • 结果:
  

(0,9)0.34399327143
  (0,7)0.519713848879
  (0,4)0.420753151645
  (0,0)0.659191117868
  (1,9)0.426858009784
  (1,4)0.522108621994
  (1,8)0.522108621994
  (1,1)0.522108621994
  (2,9)0.526261040111
  (2,7)0.397544332095
  (2,4)0.32184639876
  (2,8)0.32184639876
  (2,1)0.32184639876
  (2,3)0.504234576856
  (3,9)0.390963088213
  (3,8)0.47820398015
  (3,1)0.239101990075
  (3,10)0.374599471224
  (3,2)0.374599471224
  (3,5)0.374599471224
  (3,6)0.374599471224

tfidf_matrix 是csr_matrix。所以我发现了这一点,但没有与结果相同的结构:scipy.sparse.csr_matrix

什么结构的值为(0,9)0.34399327143?

2 个答案:

答案 0 :(得分:3)

您看到的只是调用print(my_csr_mat)时使用的字符串表示。它会列出(在您的情况下)矩阵中的所有非零。 (可能会有大量非零的截断输出。)

由于这是一个稀疏矩阵,它有2个维度。

(0, 9) 0.34399327143

表示:matrix-element @ position [0,9]为0.34399327143。

小演示:

import numpy as np
from scipy.sparse import csr_matrix

matrix_dense = np.arange(20).reshape(4,5)
zero_out = np.random.choice((0,1), size=(4,5), p=(0.7, 0.3))
matrix_dense_mod = matrix_dense * zero_out

print(matrix_dense_mod)

sparse_mat = csr_matrix(matrix_dense_mod)

print(sparse_mat)

输出:

[[ 0  0  2  0  4]
 [ 0  6  0  8  0]
 [ 0 11  0 13 14]
 [15  0  0 18 19]]
  (0, 2)        2
  (0, 4)        4
  (1, 1)        6
  (1, 3)        8
  (2, 1)        11
  (2, 3)        13
  (2, 4)        14
  (3, 0)        15
  (3, 3)        18
  (3, 4)        19

我不确定So I find on this, but there are no structure as same as the result的含义,但请注意: scipy.sparse文档中的大多数示例在print-call中都有my_mat.toarray(),这意味着它&#39 ; s从稀疏矩阵构建一个密集的数组,该矩阵具有不同的字符串表示风格

答案 1 :(得分:2)

如果没有矢量化,我可以使用这一系列操作或多或少地重建矩阵:

In [703]: documents = ( "The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun the bright sun" )

获取单词列表(全部小写):

In [704]: alist = [l.lower().split() for l in documents]

获取单词的排序列表(唯一):

In [705]: aset = set()
In [706]: [aset.update(l) for l in alist]
Out[706]: [None, None, None, None]
In [707]: unq = sorted(list(aset))
In [708]: unq
Out[708]: 
['blue',
 'bright',
 'can',
 'in',
 'is',
 'see',
 'shining',
 'sky',
 'sun',
 'the',
 'we']

浏览alist并收集字数。 rows将是句号,cols将是唯一的单词索引

In [709]: rows, cols, data = [],[],[]
In [710]: for i,row in enumerate(alist):
     ...:     for c in row:
     ...:         rows.append(i)
     ...:         cols.append(unq.index(c))
     ...:         data.append(1)
     ...:         

从这些数据中制作一个稀疏矩阵:

In [711]: M = sparse.csr_matrix((data,(rows,cols)))
In [712]: M
Out[712]: 
<4x11 sparse matrix of type '<class 'numpy.int32'>'
    with 21 stored elements in Compressed Sparse Row format>
In [713]: print(M)
  (0, 0)    1
  (0, 4)    1
  (0, 7)    1
  (0, 9)    1
  (1, 1)    1
  ....
  (3, 9)    2
  (3, 10)   1
In [714]: M.A        # viewed as 2d array
Out[714]: 
array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0],
       [0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0],
       [0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32)

由于这是使用sklearn,我可以用:

重现你的矩阵
In [717]: from sklearn import feature_extraction
In [718]: tf = feature_extraction.text.TfidfVectorizer()
In [719]: tfM = tf.fit_transform(documents)
In [720]: tfM
Out[720]: 
<4x11 sparse matrix of type '<class 'numpy.float64'>'
    with 21 stored elements in Compressed Sparse Row format>
In [721]: print(tfM)
  (0, 9)    0.34399327143
  (0, 7)    0.519713848879
  (0, 4)    0.420753151645
  ....
  (3, 5)    0.374599471224
  (3, 6)    0.374599471224
In [722]: tfM.A
Out[722]: 
array([[ 0.65919112,  0.        ,  0.        ,  0.        ,  0.42075315,
         0.        ,  0.        ,  0.51971385,  0.        ,  0.34399327,
         0.        ],....
       [ 0.        ,  0.23910199,  0.37459947,  0.        ,  0.        ,
         0.37459947,  0.37459947,  0.        ,  0.47820398,  0.39096309,
         0.37459947]])

实际数据存储为3个属性数组:

In [723]: tfM.indices
Out[723]: 
array([ 9,  7,  4,  0,  9,  4,  8,  1,  9,  7,  4,  8,  1,  3,  9,  8,  1,
       10,  2,  5,  6], dtype=int32)
In [724]: tfM.data
Out[724]: 
array([ 0.34399327,  0.51971385,  0.42075315,  0.65919112,  0.42685801,
       ...
        0.37459947])
In [725]: tfM.indptr
Out[725]: array([ 0,  4,  8, 14, 21], dtype=int32)

各行的indices值告诉我们该句子中出现哪些单词:

In [726]: np.array(unq)[M[0,].indices]
Out[726]: 
array(['blue', 'is', 'sky', 'the'],
      dtype='<U7')
In [727]: np.array(unq)[M[3,].indices]
Out[727]: 
array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'],
      dtype='<U7')