Question

我有一个关于sklearn的TfidfVectorizer的问题，当它在每个文档中执行单词的频率时。

我看到的示例代码是：

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> corpus = [

>>>     'The dog ate a sandwich and I ate a sandwich',
>>>     'The wizard transfigured a sandwich'
>>> ]

>>> vectorizer = TfidfVectorizer(stop_words='english')

>>> print vectorizer.fit_transform(corpus).todense()

[[ 0.75458397  0.37729199  0.53689271  0.          0.        ]
[ 0.          0.          0.44943642  0.6316672   0.6316672 ]]

我的问题是：如何解释矩阵中的数字？我理解0意味着单词即向导在第一个文档中出现0次因此它是0，但是如何解释数字0.75458397？它是＃34; ate＆＃34;这个词的频率吗？出现在第一份文件中？或者单词的频率＆＃34; ate＆＃34;那发生在整个语料库中？

Answer 1

TF-IDF（表示＆＃34;术语频率 - 逆文档频率＆＃34;），不为您提供其表示中术语的频率。

TF-IDF对仅在极少数文档中出现的术语给出高分，并且在许多文档中出现术语的得分低，因此它粗略地说明了术语在给定文档中的辨别程度。查看this资源，找到TF-IDF的优秀描述，并更好地了解它的作用。

如果您只想要点数，则需要使用CountVectorizer。

Answer 2

我认为您忘记了TF-IDF向量通常被归一化，因此它们的幅度（长度或2范数）始终为1。

因此，TFIDF值0.75是“ate”的频率乘以“ate”的逆文档频率，然后除以该TF-IDF向量的幅度。

以下是所有脏的详细信息（跳至tfidf0 =为妙语）：

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The dog ate a sandwich and I ate a sandwich",
          "The wizard transfigured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english')
tfidfs = vectorizer.fit_transform(corpus)


from collections import Counter
import pandas as pd

columns = [k for (v, k) in sorted((v, k)
           for k, v in vectorizer.vocabulary_.items())]
tfidfs = pd.DataFrame(tfidfs.todense(),
                      columns=columns)
#     ate   dog  sandwich  transfigured  wizard 
#0   0.75  0.38      0.54          0.00    0.00
#1   0.00  0.00      0.45          0.63    0.63

df = (1 / pd.DataFrame([vectorizer.idf_], columns=columns))
#     ate   dog  sandwich  transfigured  wizard
#0   0.71  0.71       1.0          0.71    0.71
corp = [txt.lower().split() for txt in corpus]
corp = [[w for w in d if w in vectorizer.vocabulary_] for d in corp]
tfs = pd.DataFrame([Counter(d) for d in corp]).fillna(0).astype(int)
#    ate  dog  sandwich  transfigured  wizard
#0    2    1         2             0       0
#1    0    0         1             1       1

# The first document's TFIDF vector:
tfidf0 = tfs.iloc[0] * (1. / df)
tfidf0 = tfidf0 / pd.np.linalg.norm(tfidf0)
#        ate       dog  sandwich  transfigured  wizard
#0  0.754584  0.377292  0.536893           0.0     0.0

tfidf1 = tfs.iloc[1] * (1. / df)
tfidf1 = tfidf1 / pd.np.linalg.norm(tfidf1)
#    ate  dog  sandwich  transfigured    wizard
#0   0.0  0.0  0.449436      0.631667  0.631667

Answer 3

只需打印下面的代码，您就会看到输出类似的东西

#(0, 1)        0.448320873199    Document 1, term = Dog
#(0, 3)        0.630099344518    Document 1, term = Sandwitch

    print(vectorizer.fit_transform(corpus))  
# if python 3 other wise remove () in print

Answer 4

注意：如果只有字母组合符号，请使用此

sklearn的tfidfvectorizer不会直接给您计数。要获得计数，您可以使用TfidfVectorizer类方法inverse_transform和build_tokenizer

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich'
]

vectorizer = TfidfVectorizer(stop_words='english')

X = vectorizer.fit_transform(corpus)
X_words = tfidf.inverse_transform(X) ## this will give you words instead of tfidf where tfidf > 0

tokenizer = vectorizer.build_tokenizer() ## return tokenizer function used in tfidfvectorizer

for idx,words in enumerate(X_words):
    for word in words:
        count = tokenizer(corpus[idx]).count(word)
        print(idx,word,count)

输出

0 dog 1
0 ate 2
0 sandwich 2
1 sandwich 1
1 wizard 1
1 transfigured 1
#0 means first sentence in corpus

这是一项围绕希望可能对某人有所帮助的工作：）

Answer 5

该行应为vectorizer X_words = tfidf.inverse_transform(X) 而不是tfidf。

sklearn的TfidfVectorizer字频率？

5 个答案: