nltk.ConditionalFreqDistribution期间的python断言错误

时间:2013-07-08 16:59:24

标签: python text python-2.7 nltk

我在尝试执行某些python代码时收到一个我不明白的错误。我试图通过优秀的NLTK教科书学习使用自然语言工具包。在尝试以下代码时(对我自己的数据进行图2.1的修改),我收到了以下错误。

我运行的代码:

import os, re, csv, string, operator
import nltk
from nltk.corpus import PlaintextCorpusReader
dir = '/Dropbox/hearings'

corpus_root = dir
text = PlaintextCorpusReader(corpus_root, ".*")

cfd = nltk.ConditionalFreqDist(
    (target, fileid[:3])
     for fileid in text.fileids()
     for w in text.words(fileid)
     for target in ['budget','appropriat']
     if w.lower().startswith(target))

cfd.plot()

我收到错误(完全追溯):

In [6]: ---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-6-abc9ff8cb2f1> in <module>()
----> 1 execfile(r'/Dropbox/hearings/hearings_ingest.py') # PYTHON-MODE

/Dropbox/hearings/hearings_ingest.py in <module>()
     14 cfd = nltk.ConditionalFreqDist(
     15     (target, fileid[:3])
---> 16      for fileid in text.fileids()
     17      for w in text.words(fileid)
     18      for target in ['budget','appropriat']

/Users/ian/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/nltk/probability.pyc in __init__(self, cond_samples)
   1727         defaultdict.__init__(self, FreqDist)
   1728         if cond_samples:
-> 1729             for (cond, sample) in cond_samples:
   1730                 self[cond].inc(sample)
   1731 

/Dropbox/hearings/hearings_ingest.py in <genexpr>((fileid,))
     15     (target, fileid[:3])
     16      for fileid in text.fileids()
---> 17      for w in text.words(fileid)
     18      for target in ['budget','appropriat']
     19      if w.lower().startswith(target))

/Users/ian/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/nltk/corpus/reader/util.pyc in iterate_from(self, start_tok)
    341 
    342         # If we reach this point, then we should know our length.
--> 343         assert self._len is not None
    344 
    345     # Use concat for these, so we can use a ConcatenatedCorpusView

AssertionError: 

In [7]: 

我包含新的IPython行以显示这是完整的错误。 (在阅读其他问题时,我看到“AssertionError:”后面经常会有更多信息。在我的错误中,它是空白的。)

我很感激在理解我的代码中的错误方面有任何帮助!谢谢!

1 个答案:

答案 0 :(得分:1)

我可以通过创建空文件foo,然后调用text.words('foo')来重现错误:

In [18]: !touch 'foo'

In [19]: text = corpus.PlaintextCorpusReader('.', "foo")

In [20]: text.words('foo')
AssertionError:

所以为了避免空文件,你可以这样做:

cfd = nltk.ConditionalFreqDist(
    (target, fileid[:3])
    for fileid in text.fileids()
    if os.path.getsize(fileid) > 0   # check the filesize is not 0
    for w in text.words(fileid)
    for target in ['budget', 'appropriat']
    if w.lower().startswith(target))