Question

我编写了以下代码来制作单词袋：

       final ArrayAdapter<String> url_spinnerArrayAdapter = new adapter_spinner_steuer(URL.this, R.layout.public_spinner_item, url_array);
    url_spinnerArrayAdapter.setDropDownViewResource(R.layout.public_spinner_item);
    spinner_url.setAdapter(url_spinnerArrayAdapter);

我得到以下输出：

count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(data['description'].values.astype('U'))
vocab = count_vect.get_feature_names()
print(type(final_counts)) #final_counts is a sparse matrix
print("--------------------------------------------------------------")
print(final_counts.shape)
print("--------------------------------------------------------------")
print(final_counts.toarray())
print("--------------------------------------------------------------")
print(final_counts[769].shape)
print("--------------------------------------------------------------")
print(final_counts[769])
print("--------------------------------------------------------------")
print(final_counts[769].toarray())
print("--------------------------------------------------------------")
print(len(vocab))
print("--------------------------------------------------------------")

很明显，语料库中有770个文档和10,252个唯一词。我的困惑是为什么在我的代码中此行<class 'scipy.sparse.csr.csr_matrix'> -------------------------------------------------------------- (770, 10252) -------------------------------------------------------------- [[0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] ... [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0]] -------------------------------------------------------------- (1, 10252) -------------------------------------------------------------- (0, 4819) 1 (0, 2758) 1 (0, 3854) 2 (0, 3987) 1 (0, 1188) 1 (0, 3233) 1 (0, 981) 1 (0, 10065) 1 (0, 9811) 1 (0, 8932) 1 (0, 9599) 1 (0, 10150) 1 (0, 7716) 1 (0, 10045) 1 (0, 5783) 1 (0, 5500) 1 (0, 5455) 1 (0, 3234) 1 (0, 7107) 1 (0, 6504) 1 (0, 3235) 1 (0, 1625) 1 (0, 3591) 1 (0, 6525) 1 (0, 365) 1 : : (0, 5527) 1 (0, 9972) 1 (0, 4526) 3 (0, 3592) 4 (0, 10214) 1 (0, 895) 1 (0, 10062) 2 (0, 10210) 1 (0, 1246) 1 (0, 9224) 2 (0, 4924) 1 (0, 6336) 2 (0, 9180) 8 (0, 6366) 2 (0, 414) 12 (0, 1307) 1 (0, 9309) 1 (0, 9177) 1 (0, 3166) 1 (0, 396) 1 (0, 9303) 7 (0, 320) 5 (0, 4782) 2 (0, 10088) 3 (0, 4481) 3 -------------------------------------------------------------- [[0 0 0 ... 0 0 0]] -------------------------------------------------------------- 10252 --------------------------------------------------------------会打印以下内容：

print(final_counts[769])

第一个索引是文档索引。我正在打印第769个文档的矢量（从0开始）。因此，第一个索引应该是769而不是0，例如(0, 4819) 1 (0, 2758) 1 (0, 3854) 2 (0, 3987) 1 (0, 1188) 1 (0, 3233) 1 (0, 981) 1 (0, 10065) 1 (0, 9811) 1 (0, 8932) 1 (0, 9599) 1 (0, 10150) 1 (0, 7716) 1 (0, 10045) 1 (0, 5783) 1 (0, 5500) 1 (0, 5455) 1 (0, 3234) 1 (0, 7107) 1 (0, 6504) 1 (0, 3235) 1 (0, 1625) 1 (0, 3591) 1 (0, 6525) 1 (0, 365) 1 : : (0, 5527) 1 (0, 9972) 1 (0, 4526) 3 (0, 3592) 4 (0, 10214) 1 (0, 895) 1 (0, 10062) 2 (0, 10210) 1 (0, 1246) 1 (0, 9224) 2 (0, 4924) 1 (0, 6336) 2 (0, 9180) 8 (0, 6366) 2 (0, 414) 12 (0, 1307) 1 (0, 9309) 1 (0, 9177) 1 (0, 3166) 1 (0, 396) 1 (0, 9303) 7 (0, 320) 5 (0, 4782) 2 (0, 10088) 3 (0, 4481) 3。为什么不是这样？

Answer 1

如here所述，这是因为它是稀疏矩阵。

如果在矢量化工具中有100个具有964个特征的文档

vectorizer = CountVectorizer()
transformed = vectorizer.fit_transform(documents)
>>> transformed
<100x964 sparse matrix of type '<class 'numpy.int64'>'
    with 3831 stored elements in Compressed Sparse Row format>

如果您打印整个矩阵，则将获得每个文档中非零元素的坐标，这就是您的

<（文档索引，语料库中的单词索引）该文档中该单词的计数>

>>> print(transformed)
  (0, 30)   1
  (0, 534)  1
  (0, 28)   1
  (0, 232)  2
  (0, 298)  1
  (0, 800)  1
  (0, 126)  1
  : :
  (98, 467) 8
  (98, 461) 63
  (98, 382) 88
  (98, 634) 4
  (98, 15)  1
  (98, 450) 1139
  (99, 441) 1940

，例如print(transformed[(99, 441)])是1940

致电print(transformed[0])会得到以下信息：

  (0, 30)   1
  (0, 534)  1
  (0, 28)   1
  (0, 232)  2
  (0, 298)  1
  (0, 800)  1
  : :
  (0, 683)  12
  (0, 15)   1
  (0, 386)  1
  (0, 255)  1
  (0, 397)  1
  (0, 450)  10
  (0, 682)  2782

因为transformed[0]本身是一个稀疏矩阵，上面有一行，上面印有32个非零元素

>>> transformed[0] 
<1x964 sparse matrix of type '<class 'numpy.int64'>'
with 32 stored elements in Compressed Sparse Row format>

，您可以使用这些元组访问它，例如transformed[0][(0, 682)]返回2782。

（请注意，transformed[0].toarray().shape是(1, 964)而不是(964,)）

文本分类+单词袋+ Python：单词袋不显示文档索引

1 个答案: