我在NLTK中构建TaggedCorpusReader(使用ipython笔记本)从ANC读取一些POS标记文件。 (http://www.anc.org/)我想从标记语料库中获取所有形容词。这就是我的尝试:
anc = nltk.corpus.reader.tagged.TaggedCorpusReader(anc_root, r".*\.txt", sep='_')
tagged_words = anc.tagged_words()
anc_adj = {word.lower() for word, pos in tagged_words if pos =='JJ'}
所有函数(tagged_words(),words(),sents()等)都可以正常工作。但是当我尝试进行集合理解时,我得到以下断言错误:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-70-4ba2a8ab817a> in <module>()
2 tagged_words = anc.tagged_words()
3 print(tagged_words)
----> 4 anc_adj = {word.lower() for word, pos in tagged_words if pos =='JJ'}
<ipython-input-70-4ba2a8ab817a> in <setcomp>(.0)
2 tagged_words = anc.tagged_words()
3 print(tagged_words)
----> 4 anc_adj = {word.lower() for word, pos in tagged_words if pos =='JJ'}
C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok)
400
401 # Get everything we can from this piece.
--> 402 for tok in piece.iterate_from(max(0, start_tok-offset)):
403 yield tok
404
C:\Program Files\Anaconda3\lib\site-packages\nltk\corpus\reader\util.py in iterate_from(self, start_tok)
299 self.read_block.__name__)
300 num_toks = len(tokens)
--> 301 new_filepos = self._stream.tell()
302 assert new_filepos > filepos, (
303 'block reader %s() should consume at least 1 byte (filepos=%d)' %
C:\Program Files\Anaconda3\lib\site-packages\nltk\data.py in tell(self)
1364 check1 = self._incr_decode(self.stream.read(50))[0]
1365 check2 = ''.join(self.linebuffer)
-> 1366 assert check1.startswith(check2) or check2.startswith(check1)
1367
1368 # Return to our original filepos (so we don't have to throw
AssertionError:
我不知道这意味着什么!有人可以帮我理解这里的问题是什么吗?对布朗语料库进行设置理解很好......发生了什么事?