使用NLTK Python 3

时间:2016-08-03 19:27:46

标签: python count nltk

我有一个问题,实际上应该很容易用NLTK解决。我找到了解决问题的方法,但他们没有使用NLTK:

how can I count the specific bigram words?

是否可以使用NLTK功能执行此操作?

这是我的代码:

food = open("food_low.txt")
lines = food.read().split(',')[:-1]

raw = wordlists.words("cleaned2.txt")
fdist = nltk.FreqDist(w.lower() for w in raw)

with io.open('nltk1.txt', 'w', encoding="utf-8") as h:
    for m in lines:
        if fdist[m] > 0:
            print(m + ':', fdist[m], end=' ', file = h)

我正在计算food_low.txtcleaned2.txt中出现的字词的频率。我的问题是我在food_low.txt中有一些二元词,并且不计算在内。我怎么能认为它也算上了双子座?

1 个答案:

答案 0 :(得分:1)

你可以尝试在没有NLTK和使用正则表达式(re)的情况下计算unigram和bigram。现在您不需要两个单独的计算,但您可以使用re.findall()

一次性完成
import re
import codecs

# List of words and a sentence
l = ['cow', 'dog', 'hot dog', 'pet candy']
s = 'since the opening of the bla and hot dog in the hot dog cow'

# Make your own fdist
fdist = {}
for m in l:
    # Find all occurrences of m in l and store the frequency in fdist[m]
    fdist[m] = len(re.findall(m, s))

# Write the wordcounts for each word to a file (if fdist[m] > 0)
with codecs.open('nltk1.txt', 'w', encoding='utf8') as out:
    for m in l:
        if fdist[m] > 0:
            out.write('{}:\t{}\n'.format( m, fdist[m] ) )

nltk1.txt的内容:

cow:    1
dog:    2
hot dog:    2

注意:如果您想使用NLTK this answer might fulfill your needs