我有一个问题,实际上应该很容易用NLTK解决。我找到了解决问题的方法,但他们没有使用NLTK:
how can I count the specific bigram words?
是否可以使用NLTK功能执行此操作?
这是我的代码:
food = open("food_low.txt")
lines = food.read().split(',')[:-1]
raw = wordlists.words("cleaned2.txt")
fdist = nltk.FreqDist(w.lower() for w in raw)
with io.open('nltk1.txt', 'w', encoding="utf-8") as h:
for m in lines:
if fdist[m] > 0:
print(m + ':', fdist[m], end=' ', file = h)
我正在计算food_low.txt
中cleaned2.txt
中出现的字词的频率。我的问题是我在food_low.txt
中有一些二元词,并且不计算在内。我怎么能认为它也算上了双子座?
答案 0 :(得分:1)
你可以尝试在没有NLTK和使用正则表达式(re
)的情况下计算unigram和bigram。现在您不需要两个单独的计算,但您可以使用re.findall()
:
import re
import codecs
# List of words and a sentence
l = ['cow', 'dog', 'hot dog', 'pet candy']
s = 'since the opening of the bla and hot dog in the hot dog cow'
# Make your own fdist
fdist = {}
for m in l:
# Find all occurrences of m in l and store the frequency in fdist[m]
fdist[m] = len(re.findall(m, s))
# Write the wordcounts for each word to a file (if fdist[m] > 0)
with codecs.open('nltk1.txt', 'w', encoding='utf8') as out:
for m in l:
if fdist[m] > 0:
out.write('{}:\t{}\n'.format( m, fdist[m] ) )
nltk1.txt的内容:
cow: 1
dog: 2
hot dog: 2
注意:如果您想使用NLTK this answer might fulfill your needs。