Question

我有一个标有“all.txt”的文本文件，它包含一个普通的英文段落

由于某些原因，当我运行此代码时：

    import nltk
    from nltk.collocations import *
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    trigram_measures = nltk.collocations.TrigramAssocMeasures()

    # change this to read in your data                                                                                                                                                   
    finder = BigramCollocationFinder.from_words(('all.txt'))

    # only bigrams that appear 3+ times                                                                                                                                                  
    #finder.apply_freq_filter(3)                                                                                                                                                         

    # return the 10 n-grams with the highest PMI                                                                                                                                         
    print finder.nbest(bigram_measures.pmi, 10)

我得到以下结果：

       [('.', 't'), ('a', 'l'), ('l', '.'), ('t', 'x'), ('x', 't')]

我做错了什么，因为我只收到信件？我在找字而不是字母！

以下是“all.txt”中的内容示例，因此您可以了解正在处理的内容： “并不只是反对这一计划的民主人士。全国各地的美国人都表示反对这个计划。我的民主同事和我有一个更好的计划，将加强道德规则，以改善国会的问责制，并确保立法在适当考虑之后，共和党计划未能填补一个漏洞，允许在成员阅读之前考虑立法。“

Answer 1

第一个问题是你实际上并没有读取文件，你只是将包含文件路径的字符串传递给函数，第二个问题是你需要先使用一个标记器。解决第二个问题：

from nltk.tokenize import word_tokenize
finder = BigramCollocationFinder.from_words(word_tokenize("This is a test sentence"))
print finder.nbest(bigram_measures.pmi, 10)

收益率[('This', 'is'), ('a', 'test'), ('is', 'a'), ('test', 'sentence')]

请注意，您可能希望使用其他标记器 - tokenize包文档将详细介绍各种选项。

在第一种情况下，你可以使用类似的东西：

with open('all.txt', 'r') as data_file:
    finder = BigramCollocationFinder.from_words(word_tokenize(data_file.read())

与NLTK bigram finder的麻烦

1 个答案: