我试图让NLTK将整个语料库中的三元组制成12,000个文本文件,然后将每个三元组的频率分布打印到文件中,但会出现以下错误:
Traceback (most recent call last):
File "TPNngrams2.py", line 19, in <module>
fdisttab = fdist.tabulate()
File "/Library/Python/2.7/site-packages/nltk/probability.py", line 281, in tabulate
print("%4s" % samples[i], end=' ')
TypeError: not all arguments converted during string formatting
以下是代码:
import nltk
import re
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk import FreqDist
#this imports the text files in the folder into corpus called speeches
corpus_root = '/Users/root'
speeches = PlaintextCorpusReader(corpus_root, '.*\.txt')
print "Finished importing corpus"
fdist = nltk.FreqDist() # Empty distribution
for filename in speeches.fileids():
(str(trigram) for trigram in nltk.trigrams(speeches.words(filename)))
fdist.update(nltk.trigrams(speeches.words(filename)))
fdisttab = fdist.tabulate()
print fdisttab
f = open('freqdists.txt', 'w+')
f.write(fdisttab)
f.close()
print "All done. Check file."
提前感谢您的帮助。我不知道如何开始解决这个问题。