Question

我有一个问题，实际上应该很容易用NLTK解决。我找到了解决问题的方法，但他们没有使用NLTK：

how can I count the specific bigram words?

是否可以使用NLTK功能执行此操作？

这是我的代码：

food = open("food_low.txt")
lines = food.read().split(',')[:-1]

raw = wordlists.words("cleaned2.txt")
fdist = nltk.FreqDist(w.lower() for w in raw)

with io.open('nltk1.txt', 'w', encoding="utf-8") as h:
    for m in lines:
        if fdist[m] > 0:
            print(m + ':', fdist[m], end=' ', file = h)

我正在计算food_low.txt中cleaned2.txt中出现的字词的频率。我的问题是我在food_low.txt中有一些二元词，并且不计算在内。我怎么能认为它也算上了双子座？

Answer 1

你可以尝试在没有NLTK和使用正则表达式（re）的情况下计算unigram和bigram。现在您不需要两个单独的计算，但您可以使用re.findall()：

一次性完成

import re
import codecs

# List of words and a sentence
l = ['cow', 'dog', 'hot dog', 'pet candy']
s = 'since the opening of the bla and hot dog in the hot dog cow'

# Make your own fdist
fdist = {}
for m in l:
    # Find all occurrences of m in l and store the frequency in fdist[m]
    fdist[m] = len(re.findall(m, s))

# Write the wordcounts for each word to a file (if fdist[m] > 0)
with codecs.open('nltk1.txt', 'w', encoding='utf8') as out:
    for m in l:
        if fdist[m] > 0:
            out.write('{}:\t{}\n'.format( m, fdist[m] ) )

nltk1.txt的内容：

cow:    1
dog:    2
hot dog:    2

注意：如果您想使用NLTK this answer might fulfill your needs。

使用NLTK Python 3

1 个答案: