我正在研究TFIDF。我使用过 tfidf_vectorizer.fit_transform 。它返回一个csr_matrix,但我无法理解结果的结构。
文件=(“天空是蓝色的”,“太阳是明亮的”,“太阳在 天空很明亮“,”我们可以看到灿烂的阳光,灿烂的阳光“)
tfidf_vectorizer = TfidfVectorizer() tfidf_matrix = tfidf_vectorizer.fit_transform(documents) print(tfidf_matrix)
(0,9)0.34399327143
(0,7)0.519713848879
(0,4)0.420753151645
(0,0)0.659191117868
(1,9)0.426858009784
(1,4)0.522108621994
(1,8)0.522108621994
(1,1)0.522108621994
(2,9)0.526261040111
(2,7)0.397544332095
(2,4)0.32184639876
(2,8)0.32184639876
(2,1)0.32184639876
(2,3)0.504234576856
(3,9)0.390963088213
(3,8)0.47820398015
(3,1)0.239101990075
(3,10)0.374599471224
(3,2)0.374599471224
(3,5)0.374599471224
(3,6)0.374599471224
tfidf_matrix 是csr_matrix。所以我发现了这一点,但没有与结果相同的结构:scipy.sparse.csr_matrix
什么结构的值为(0,9)0.34399327143?
答案 0 :(得分:3)
您看到的只是调用print(my_csr_mat)
时使用的字符串表示。它会列出(在您的情况下)矩阵中的所有非零。 (可能会有大量非零的截断输出。)
由于这是一个稀疏矩阵,它有2个维度。
(0, 9) 0.34399327143
表示:matrix-element @ position [0,9]为0.34399327143。
小演示:
import numpy as np
from scipy.sparse import csr_matrix
matrix_dense = np.arange(20).reshape(4,5)
zero_out = np.random.choice((0,1), size=(4,5), p=(0.7, 0.3))
matrix_dense_mod = matrix_dense * zero_out
print(matrix_dense_mod)
sparse_mat = csr_matrix(matrix_dense_mod)
print(sparse_mat)
输出:
[[ 0 0 2 0 4]
[ 0 6 0 8 0]
[ 0 11 0 13 14]
[15 0 0 18 19]]
(0, 2) 2
(0, 4) 4
(1, 1) 6
(1, 3) 8
(2, 1) 11
(2, 3) 13
(2, 4) 14
(3, 0) 15
(3, 3) 18
(3, 4) 19
我不确定So I find on this, but there are no structure as same as the result
的含义,但请注意: scipy.sparse文档中的大多数示例在print-call中都有my_mat.toarray(),这意味着它&#39 ; s从稀疏矩阵构建一个密集的数组,该矩阵具有不同的字符串表示风格。
答案 1 :(得分:2)
如果没有矢量化,我可以使用这一系列操作或多或少地重建矩阵:
In [703]: documents = ( "The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun the bright sun" )
获取单词列表(全部小写):
In [704]: alist = [l.lower().split() for l in documents]
获取单词的排序列表(唯一):
In [705]: aset = set()
In [706]: [aset.update(l) for l in alist]
Out[706]: [None, None, None, None]
In [707]: unq = sorted(list(aset))
In [708]: unq
Out[708]:
['blue',
'bright',
'can',
'in',
'is',
'see',
'shining',
'sky',
'sun',
'the',
'we']
浏览alist
并收集字数。 rows
将是句号,cols
将是唯一的单词索引
In [709]: rows, cols, data = [],[],[]
In [710]: for i,row in enumerate(alist):
...: for c in row:
...: rows.append(i)
...: cols.append(unq.index(c))
...: data.append(1)
...:
从这些数据中制作一个稀疏矩阵:
In [711]: M = sparse.csr_matrix((data,(rows,cols)))
In [712]: M
Out[712]:
<4x11 sparse matrix of type '<class 'numpy.int32'>'
with 21 stored elements in Compressed Sparse Row format>
In [713]: print(M)
(0, 0) 1
(0, 4) 1
(0, 7) 1
(0, 9) 1
(1, 1) 1
....
(3, 9) 2
(3, 10) 1
In [714]: M.A # viewed as 2d array
Out[714]:
array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
[0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0],
[0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0],
[0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32)
由于这是使用sklearn
,我可以用:
In [717]: from sklearn import feature_extraction
In [718]: tf = feature_extraction.text.TfidfVectorizer()
In [719]: tfM = tf.fit_transform(documents)
In [720]: tfM
Out[720]:
<4x11 sparse matrix of type '<class 'numpy.float64'>'
with 21 stored elements in Compressed Sparse Row format>
In [721]: print(tfM)
(0, 9) 0.34399327143
(0, 7) 0.519713848879
(0, 4) 0.420753151645
....
(3, 5) 0.374599471224
(3, 6) 0.374599471224
In [722]: tfM.A
Out[722]:
array([[ 0.65919112, 0. , 0. , 0. , 0.42075315,
0. , 0. , 0.51971385, 0. , 0.34399327,
0. ],....
[ 0. , 0.23910199, 0.37459947, 0. , 0. ,
0.37459947, 0.37459947, 0. , 0.47820398, 0.39096309,
0.37459947]])
实际数据存储为3个属性数组:
In [723]: tfM.indices
Out[723]:
array([ 9, 7, 4, 0, 9, 4, 8, 1, 9, 7, 4, 8, 1, 3, 9, 8, 1,
10, 2, 5, 6], dtype=int32)
In [724]: tfM.data
Out[724]:
array([ 0.34399327, 0.51971385, 0.42075315, 0.65919112, 0.42685801,
...
0.37459947])
In [725]: tfM.indptr
Out[725]: array([ 0, 4, 8, 14, 21], dtype=int32)
各行的indices
值告诉我们该句子中出现哪些单词:
In [726]: np.array(unq)[M[0,].indices]
Out[726]:
array(['blue', 'is', 'sky', 'the'],
dtype='<U7')
In [727]: np.array(unq)[M[3,].indices]
Out[727]:
array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'],
dtype='<U7')