Question

我使用NLTK来计算单词的tf_idf。但他们中的大多数得分为0。

def compute_tf_idf(corpus,source_text):
    texts = []
    for text in corpus:
        if text['text'] != None:
            try:
                language = detect_lang(text['text'])
            except Exception as e:
                language = None
            # French analysing
            if language == "french":
                french_analyser = AnalyseFrenchText(text['text'])
                french_analyser.analysetext()
                tokenized_text = french_analyser.get_tokenized_text()
            if tokenized_text != None:
                texts.append(tokenized_text)
    textCorpus = TextCollection(texts)
    for word in textCorpus[:100]:
        print(word) # print correctly words
    try:
        language = detect_lang(source_text)
    except Exception as e:
        language = None
    # French analysing
    if language == "french":
        french_analyser = AnalyseFrenchText(source_text)
        french_analyser.analysetext()
        tokenized_source_text = french_analyser.get_tokenized_text()
    for word in tokenized_source_text:
        print(word)
        print("idf :" + str(textCorpus.idf(word)))
        print("tf : " + str(textCorpus.tf(word,tokenized_source_text)))
        print("tf_idf :" + str(textCorpus.tf_idf(word,tokenized_source_text)))
    return

结果：

Commande
idf :0.0
tf : 0.0024875621890547263
tf_idf :0.0

我检查了NLTK源代码来计算idf：

 """ The number of texts in the corpus divided by the
    number of texts that the term appears in.
    If a term does not appear in the corpus, 0.0 is returned. """

我是否使用NLTK的tf_idf错误？谢谢

Answer 1

你正在使用nltk的TF-IDF计算实现，所以我不确定你的意思是“我应该为最佳tf_idf分数做些什么改变”。你可以改变的是不要猜测;找出TextCollection的内容是什么样的，是否认为其中有“succursales”等等。

您可以检查单词是否在TextCollection（True或False）中，如下所示：

print("succursales" in mytexts)

要了解mytexts中的实际内容，您可以像这样迭代：

for word in mytexts[:100]:
    print(word)

我的猜测是你会看到单个字母。 TextCollection的构造函数需要一个令牌（单词）列表，看起来你看起来并不像那样。

此外，您需要将令牌列表传递给tf()，它应该是语料库中的一个文档，而不是整个语料库。但是你传递了某种语料库对象。换句话说，阅读文档，以便了解这些功能的用途，以及如何调用它们。

使用NLTK计算频率 - 逆文档频率

1 个答案: