我有以下Python脚本:
import nltk
from nltk.probability import FreqDist
nltk.download('punkt')
frequencies = {}
book = open('book.txt')
read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)
for w in words:
frequencies[w] = frequencies[w] + 1
print (frequencies)
当我尝试运行脚本时,我得到以下内容:
[nltk_data] Downloading package punkt to /home/abc/nltk_data...
[nltk_data] Package punkt is already up-to-date!
Traceback (most recent call last):
File "test.py", line 12, in <module>
frequencies[w] = frequencies[w] + 1
KeyError: '\\documentclass'
我做错了什么?并且,如何在文本文件中打印单词及其出现次数。
您可以从here下载book.txt
。
答案 0 :(得分:6)
您的frequencies
字典为空。你从一开始就得到了关键错误,这是预期的。
我建议你改用collections.Counter
。它是一个专门的字典(有点像defaultdict
),允许计算出现次数。
import nltk,collections
from nltk.probability import FreqDist
nltk.download('punkt')
frequencies = collections.Counter()
with open('book.txt') as book:
read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)
for w in words:
frequencies[w] += 1
print (frequencies)
编辑:在没有使用ntlk
包的情况下回答您的问题。我的答案就像nltk
包只是一个字符串标记器。所以更具体一点,允许在不重新发明轮子的情况下进一步进行文本分析,并且由于下面的各种评论,你应该这样做:
import nltk
from nltk.probability import FreqDist
nltk.download('punkt')
with open('book.txt') as book:
read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words) # no need for the loop, does the count job
print (frequencyDist)
你会得到(我的文字):
<FreqDist with 142 samples and 476 outcomes>
所以不是字词=&gt;直接的元素数量,但更复杂的对象承载这些信息+更多:
frequencyDist.items()
:你得到的单词=&gt; count(以及所有经典的dict方法)frequencyDist.most_common(50)
打印出50个最常用的单词frequencyDist['the']
返回"the"