Question

我有以下Python脚本：

import nltk
from nltk.probability import FreqDist
nltk.download('punkt')

frequencies = {}
book = open('book.txt')
read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)

for w in words:
    frequencies[w] = frequencies[w] + 1 

print (frequencies)

当我尝试运行脚本时，我得到以下内容：

[nltk_data] Downloading package punkt to /home/abc/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    frequencies[w] = frequencies[w] + 1 
KeyError: '\\documentclass'

我做错了什么？并且，如何在文本文件中打印单词及其出现次数。

您可以从here下载book.txt。

Answer 1

您的frequencies字典为空。你从一开始就得到了关键错误，这是预期的。

我建议你改用collections.Counter。它是一个专门的字典（有点像defaultdict），允许计算出现次数。

import nltk,collections
from nltk.probability import FreqDist
nltk.download('punkt')

frequencies = collections.Counter()
with open('book.txt') as book:
    read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)

for w in words:
    frequencies[w] += 1 

print (frequencies)

编辑：在没有使用ntlk包的情况下回答您的问题。我的答案就像nltk包只是一个字符串标记器。所以更具体一点，允许在不重新发明轮子的情况下进一步进行文本分析，并且由于下面的各种评论，你应该这样做：

import nltk
from nltk.probability import FreqDist
nltk.download('punkt')

with open('book.txt') as book:
    read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)   # no need for the loop, does the count job

print (frequencyDist)

你会得到（我的文字）：

<FreqDist with 142 samples and 476 outcomes>

所以不是字词=＆gt;直接的元素数量，但更复杂的对象承载这些信息+更多：

frequencyDist.items()：你得到的单词=＆gt; count（以及所有经典的dict方法）
frequencyDist.most_common(50)打印出50个最常用的单词
frequencyDist['the']返回"the"
...

KeyError：＆＃39; \\ documentclass＆＃39;

1 个答案: