KeyError:' \\ documentclass'

时间:2016-10-31 16:37:31

标签: python nltk

我有以下Python脚本:

import nltk
from nltk.probability import FreqDist
nltk.download('punkt')

frequencies = {}
book = open('book.txt')
read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)

for w in words:
    frequencies[w] = frequencies[w] + 1 

print (frequencies)

当我尝试运行脚本时,我得到以下内容:

[nltk_data] Downloading package punkt to /home/abc/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    frequencies[w] = frequencies[w] + 1 
KeyError: '\\documentclass'

我做错了什么?并且,如何在文本文件中打印单词及其出现次数。

您可以从here下载book.txt

1 个答案:

答案 0 :(得分:6)

您的frequencies字典为空。你从一开始就得到了关键错误,这是预期的。

我建议你改用collections.Counter。它是一个专门的字典(有点像defaultdict),允许计算出现次数。

import nltk,collections
from nltk.probability import FreqDist
nltk.download('punkt')

frequencies = collections.Counter()
with open('book.txt') as book:
    read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)

for w in words:
    frequencies[w] += 1 

print (frequencies)

编辑:在没有使用ntlk包的情况下回答您的问题。我的答案就像nltk包只是一个字符串标记器。所以更具体一点,允许在不重新发明轮子的情况下进一步进行文本分析,并且由于下面的各种评论,你应该这样做:

import nltk
from nltk.probability import FreqDist
nltk.download('punkt')

with open('book.txt') as book:
    read_book = book.read()
words = nltk.word_tokenize(read_book)
frequencyDist = FreqDist(words)   # no need for the loop, does the count job

print (frequencyDist)

你会得到(我的文字):

<FreqDist with 142 samples and 476 outcomes>

所以不是字词=&gt;直接的元素数量,但更复杂的对象承载这些信息+更多:

  • frequencyDist.items():你得到的单词=&gt; count(以及所有经典的dict方法)
  • frequencyDist.most_common(50)打印出50个最常用的单词
  • frequencyDist['the']返回"the"
  • 的出现次数
  • ...