Question

我已经为我的语料库中的单词生成了tfidf分数，并希望确定它们是哪些单词。这是我的代码和结果：

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(stop_words = 'english')
X_counts = count_vect.fit_transform(X)
X_counts.shape

Out[4]: (26, 3777)

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_counts)
X_tfidf.shape

Out[73]: (26, 3777)

print(X_tfidf)
  (0, 3378)     0.0349567750954
  (0, 3018)     0.0349567750954
  (0, 3317)     0.0349567750954
  (0, 2873)     0.0349567750954
  (0, 1678)     0.0310225609857
  (0, 2005)     0.0282311916523
  (0, 1554)     0.0349567750954
  (0, 1855)     0.0349567750954
  (0, 709)      0.0260660373875
  (0, 3101)     0.0282311916523
  (0, 2889)     0.0699135501907
  (0, 3483)     0.0193404539445
  (0, 3388)     0.0349567750954
  (0, 2418)     0.0349567750954
  (0, 2962)     0.0310225609857
  (0, 1465)     0.0349567750954
  (0, 406)      0.0310225609857
  (0, 3063)     0.0349567750954
  (0, 1070)     0.0260660373875
  (0, 1890)     0.0349567750954
  (0, 163)      0.0349567750954
  (0, 820)      0.0310225609857
  (0, 1705)     0.0349567750954
  (0, 1985)     0.0215056082093
  (0, 760)      0.0349567750954
  :     :
  (25, 711)     0.102364672113
  (25, 1512)    0.102364672113
  (25, 1674)    0.0701273701419
  (25, 2863)    0.102364672113
  (25, 765)     0.112486016266
  (25, 756)     0.0945139476693
  (25, 3537)    0.283541843008
  (25, 949)     0.0945139476693
  (25, 850)     0.0826760487146
  (25, 1289)    0.0945139476693
  (25, 3475)    0.127425722423
  (25, 186)     0.0738342053646
  (25, 3485)    0.0738342053646
  (25, 532)     0.0945139476693
  (25, 2293)    0.088099438739
  (25, 164)     0.0494476278373
  (25, 3003)    0.0475454135311
  (25, 2994)    0.200322389399
  (25, 2993)    0.133548259599
  (25, 3559)    0.369171026823
  (25, 1474)    0.0738342053646
  (25, 3728)    0.102364672113
  (25, 923)     0.0826760487146
  (25, 1291)    0.0701273701419
  (25, 2285)    0.233934283758

我想知道的是每篇文章中信息量最多的词，每篇文章的前十个词。例如，第一篇和最后一篇文章中的单词具有以下分数：

    (0, 760)      0.0349567750954
     (25, 3559)    0.369171026823
     (25, 2285)    0.233934283758

编辑：

I tested the code but I get following error. I also tested it on the X_tfidf vectors and its the same error. 

    top_n = 10
    for i in range(len(X_counts)):
        print X_tfidf.getrow(i).todense().A1.argsort()[top_n:][::-1]

    Traceback (most recent call last):

      File "<ipython-input-13-2a181d63441b>", line 2, in <module>
        for i in range(len(X_counts)):

      File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/scipy/sparse/base.py", line 199, in __len__
        raise TypeError("sparse matrix length is ambiguous; use getnnz()"

    TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

EDITED II：

好的，我换了东西，现在就可以了。但是，会生成向量，但不会生成得分最高的单词。

top_n = 10
for i in range(26):
    print tfidf.getrow(i).todense().A1.argsort()[top_n:][::-1]

[ 681 2501 3693 ..., 2451 2450 2449]
[ 552 1532 1566 ..., 2452 2451 2450]
[2285 3602  742 ..., 2455 2466 2465]
[1266 1074 1662 ..., 2481 2493 2491]
[ 397 2545 2815 ..., 2418 2417 2416]
[3559 1746  482 ..., 2456 2455 2454]
[ 562 2104 1854 ..., 2466 2477 2476]
[1158 3668  983 ..., 2470 2482 2481]
[2070  704 3418 ..., 2452 2451 2450]
[3350  515  376 ..., 2487 2500 2499]
[2266  734  735 ..., 2461 2474 2472]
[ 756 1499   60 ..., 2479 2490 2489]
[3559 3537  550 ..., 2509 2508 2507]
[3559 2882 1720 ..., 2455 2466 2465]
[3404 3199 1617 ..., 2477 2488 2487]
[1415   63   65 ..., 2474 2485 2484]
[2373 3017  441 ..., 2499 2498 2497]
[ 733 2994  516 ..., 2508 2507 2506]
[3615 2200 2387 ..., 2511 2510 2509]
[3559 2558 1289 ..., 2455 2466 2465]
[ 239 1685 2993 ..., 2485 2496 2495]
[1897 2227  357 ..., 2503 2502 2501]
[ 491 1512 3008 ..., 2506 2505 2504]
[2994  675 3125 ..., 2480 2491 2490]
[ 612 1466 2926 ..., 2424 2423 2422]
[2059 3329 3051 ..., 2479 2490 2489]

EDITED III

最后一行给出了这个错误：

Traceback (most recent call last):

  File "<ipython-input-12-813e5387f3b7>", line 9, in <module>
    print X_counts.get_feature_names()[wordindexes]

  File "/home/fledrmaus/anaconda2/lib/python2.7/site-packages/scipy/sparse/base.py", line 525, in __getattr__
    raise AttributeError(attr + " not found")

AttributeError: get_feature_names not found

我用TfidfVectorizer尝试了这个方法，今天早上我也遇到了同样的错误。

已编辑IV

print(X_counts)
  (0, 2175)     2
  (0, 481)      1
  (0, 2511)     1
  (0, 1167)     1
  (0, 3711)     9
  (0, 2501)     10
  (0, 3298)     1
  (0, 2263)     1
  (0, 2313)     1
  (0, 2939)     1
  (0, 1382)     8
  (0, 2040)     3
  (0, 3542)     1
  (0, 715)      1
  (0, 2374)     1
  (0, 2375)     1
  (0, 1643)     3
  (0, 1303)     2
  (0, 3599)     8
  (0, 708)      6
  (0, 709)      1
  (0, 1128)     1
  (0, 559)      1
  (0, 1901)     1
  (0, 2310)     1
  :     :
  (25, 2755)    1
  (25, 1380)    1
  (25, 680)     1
  (25, 1079)    1
  (25, 890)     1
  (25, 658)     1
  (25, 1363)    1
  (25, 337)     1
  (25, 3661)    1
  (25, 1035)    1
  (25, 2952)    1
  (25, 94)      1
  (25, 1906)    1
  (25, 2133)    1
  (25, 374)     1
  (25, 2099)    1
  (25, 2736)    1
  (25, 2089)    1
  (25, 3163)    1
  (25, 3680)    1
  (25, 3040)    1
  (25, 3157)    1
  (25, 1080)    1
  (25, 555)     1
  (25, 2016)    1

我再次测试了代码，然后我再次得到了矢量，但没有单词：

[ 681 2501 3693 3694 1382 3711 2141 3599 3598 1741]
[ 552 1532 1566  690 1898 3503 2730 2993 1189 1420]
[2285 3602  742 3708 3264 3668 1511 2211 3579 1291]
[1266 1074 1662 2827 3524 3069 3070 3218 1365  805]
[ 397 2545 2815 1962  213  432 2241  653  426 2117]

已编辑V：

它会产生另一个错误：

[ 681 2501 3693 3711 1382 3694 3599 2141 3598 1741]
[1532  552 1566  690 1898 3503 2730 2993 1189 1420]
[2285 3602  742 3708 3264 3668 2211 1511 1292 3579]
[1266 1074 1662 2827 3070 3524 3069 3218 1365  805]
[ 397 2545 2815 1962  213  432 2241  653  426 2117]

print count_vect.get_feature_names()[wordindexes]
Traceback (most recent call last):

  File "<ipython-input-16-95b994e8246b>", line 1, in <module>
    print count_vect.get_feature_names()[wordindexes]

TypeError: only integer arrays with one element can be converted to an index

EDITED VI

看起来这适用于一个矢量/文章，而不是五个或更多。结果如下：

wordfeatures = count_vect.get_feature_names()
for i in wordindexes:
    print wordfeatures[i]

chemical
phosphorus
weapon
white
falluja
weapons
used
marines
use
illegal

Answer 1

我假设您的X_counts是文档术语矩阵，其中每列都是一个单词，每行都是文档。

因此，wordindexes将按照它在X_counts中显示为列的顺序为您提供单词索引的列表。例如，3表示第4列（0,1,2,3）

以下代码将为X_cff中的所有文档打印X_counts中前10个单词的索引。

top_n = 10
#try this below line or use just 5 for first 5 docs
#ndocs = X_counts.shape[0]
ndocs = 5
for i in range(ndocs):
    wordindexes =  X_tfidf.getrow(i).todense().A1.argsort()[-top_n:][::-1]
    print word_indexes
    #these word_indexes are indexes of countvecorizer words vectors, use the below line to get words, whose indexes are wordindexes

    #If count_vect is output of countvectorizer object, then we can get top_n words by using following line.

    print count_vect.get_feature_names()[wordindexes]

    #or try
    wordfeatures = count_vect.get_feature_names()
    for i in wordindexes:
        print wordfeatures[i]
    print "-----------------------next doc"

如何根据他们的tfidf指数和分数找出最具代表性的单词

1 个答案: