Question

我试图利用NLTK对一批文件执行术语频率（TF）和逆文档频率（IDF）分析（它们恰好是IBM的公司新闻稿）。我知道NLTK是否具有TF IDF功能的断言How to pass parameters to the custom action?，但我发现文档表明模块确实有这些功能：

has been disputed on SO beforehand

http://www.nltk.org/_modules/nltk/text.html

我从未见过或使用过＆＃34; self＆＃34;或 init 预先执行代码。这就是我到目前为止所拥有的。任何有关如何修改此代码以使其有效的建议都非常感谢。我目前所拥有的并不能归还任何东西。我真的不明白＆＃34;来源，＆＃34; ＆＃34;自＆＃34;或者＆＃34;术语＆＃34;和＆＃34;文字＆＃34;在NLTK文档中代表。

import nltk.corpus
from nltk.text import TextCollection
from nltk.corpus import gutenberg
gutenberg.fileids()

ibm1 = gutenberg.words('ibm-github.txt')
ibm2 = gutenberg.words('ibm-alior.txt')

mytexts = TextCollection([ibm1, ibm2])
term = 'software'

def __init__(self, source):
    if hasattr(source, 'words'):
        source = [source.words(f) for f in source.fileids()]

    self._texts = source
    Text.__init__(self, LazyConcatenation(source))
    self._idf_cache = {}

def tf(self, term, mytexts):
    result = mytexts.count(term) / len(mytexts)
    print(result)

Answer 1

from nltk.text import TextCollection
from nltk.book import text1, text2, text3

mytexts = TextCollection([text1, text2, text3])

# Print the IDF of a word
print(mytexts.idf("Moby"))

# tf_idf
print(mytexts.tf_idf("Moby", text1))

查找术语频率和逆文档频率利用NLTK（Python 3.5）

1 个答案: