通过NLTK制表和打印频率分布

时间:2014-10-03 16:58:33

标签: python nltk frequency-distribution

我试图让NLTK将整个语料库中的三元组制成12,000个文本文件,然后将每个三元组的频率分布打印到文件中,但会出现以下错误:

Traceback (most recent call last):
  File "TPNngrams2.py", line 19, in <module>
    fdisttab = fdist.tabulate()
  File "/Library/Python/2.7/site-packages/nltk/probability.py", line 281, in tabulate
     print("%4s" % samples[i], end=' ')
TypeError: not all arguments converted during string formatting

以下是代码:

import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import FreqDist

#this imports the text files in the folder into corpus called speeches
corpus_root = '/Users/root'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')

print "Finished importing corpus"
fdist = nltk.FreqDist()  # Empty distribution

for filename in speeches.fileids():
    (str(trigram) for trigram in nltk.trigrams(speeches.words(filename)))
    fdist.update(nltk.trigrams(speeches.words(filename)))

fdisttab = fdist.tabulate()
print fdisttab
f = open('freqdists.txt', 'w+')
f.write(fdisttab)
f.close()

print "All done. Check file."

提前感谢您的帮助。我不知道如何开始解决这个问题。

0 个答案:

没有答案