我正在尝试在我的语料库中执行句子分块。首先我加载了我的标记数据,然后我试图在标记的语料库中执行分块。这是我的代码。
def load_corpus():
corpus_root = os.path.abspath('../nlp1/dumpfiles')
mycorpus = nltk.corpus.reader.TaggedCorpusReader(corpus_root,'.*')
return mycorpus.tagged_sents()
def sents_chunks(tagg_sents, pos_tag_pattern):
chunk_freq_dict = defaultdict(int)
chunker = nltk.RegexpParser(pos_tag_pattern)
for sent in tagg_sents:
if not all(sent):
print("NoneType object in \"{}\": {}".format(sent.label(),sent))
sent = cast_to_tree_function(filter(bool, sent))
for chk in chunker.parse(sent).subtrees():
if str(chk).startswith('(NP'):
phrase = chk.__unicode__()[4:-1]
#print(phrase)
if '\n' in phrase:
phrase = ' '.join(phrase.split())
#print(phrase)
chunk_freq_dict[phrase] += 1
#print(chunk_freq_dict)
return chunk_freq_dict
我在我的语料库中的某个地方出现错误,我不知道的地方和原因。任何人都知道这是什么问题,我该如何解决?这是错误:
Traceback (most recent call last):
File "multiwords1.py", line 184, in <module>
candidates = main(domain_corpus, PATTERN,MIN_FREQ,MIN_CVAL)
File "multiwords1.py", line 156, in main
chunks_freqs = sents_chunks(domain_sents, pos_tag_pattern)
File "multiwords1.py", line 23, in sents_chunks
for chk in chunker.parse(sent).subtrees():
File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 1208, in parse
chunk_struct = parser.parse(chunk_struct, trace=trace)
File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 1023, in parse
chunkstr = ChunkString(chunk_struct)
File "/usr/local/lib/python3.5/dist-packages/nltk/chunk/regexp.py", line 98, in __init__
self._str = '<' + '><'.join(tags) + '>'
TypeError: sequence item 352: expected str instance, NoneType found
答案 0 :(得分:0)
你有一个TypeError执行。它来自标签的消息项352没有类型(NoneType),这意味着sent
(ntlk.tree.Tree class)中有一个NoneType对象。
This line is the reason for the exception,因为str.join只能str。您需要检查sent
iterable中str type所属关联的每个项目。
您可以使用filter内置函数,但结果应该转换为Tree type。
filter(bool, sent) # return a iterator with valid items
要检查可迭代对象具有NoneType项目,您可以执行以下操作:
if not all(sent):
print("NoneType object in \"{}\": {}".format(sent.label(), sent))
sent = cast_to_tree_function(filter(bool, sent)) # update set object to valid items