如何从 NLTK 调用 NER ,以便在所有 TXT 文件的前两百个字符中获取所有结果位于同一目录?
当我尝试这段代码时:
for filename in os.listdir(ebooksFolder):
fname, fextension = os.path.splitext(filename)
if (fextension == '.txt'):
newName = 'ner_' + filename
file = open(ebooksFolder + '\\' + filename)
rawFile = file.read()
partToUse = rawFile[:50]
segmentedSentences = nltk.sent_tokenize(partToUse)
tokenizedSentences = [nltk.word_tokenize(sent) for sent in segmentedSentences]
posTaggedSentences = [nltk.pos_tag(sent) for sent in tokenizedSentences]
nerResult = nltk.ne_chunk(posTaggedSentences)
pathToCopy = 'C:\\Users\\Felipe\\Desktop\\books_txt\\'
nameToSave = os.path.join(pathToCopy, newName + '.txt')
newFile = open(nameToSave, 'w')
newFile.write(nerResult)
newFile.close()
我收到这些错误:
Traceback (most recent call last):
File "<pyshell#77>", line 11, in <module>
nerResult = nltk.ne_chunk(posTaggedSentences)
File "C:\Python27\lib\site-packages\nltk\chunk\__init__.py", line 177, in ne_chunk
return chunker.parse(tagged_tokens)
File "C:\Python27\lib\site-packages\nltk\chunk\named_entity.py", line 116, in parse
tagged = self._tagger.tag(tokens)
File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 58, in tag
tags.append(self.tag_one(tokens, i, tags))
File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 78, in tag_one
tag = tagger.choose_tag(tokens, index, history)
File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 554, in choose_tag
featureset = self.feature_detector(tokens, index, history)
File "C:\Python27\lib\site-packages\nltk\tag\sequential.py", line 605, in feature_detector
return self._feature_detector(tokens, index, history)
File "C:\Python27\lib\site-packages\nltk\chunk\named_entity.py", line 49, in _feature_detector
pos = simplify_pos(tokens[index][1])
File "C:\Python27\lib\site-packages\nltk\chunk\named_entity.py", line 178, in simplify_pos
if s.startswith('V'): return "V"
AttributeError: 'tuple' object has no attribute 'startswith'
答案 0 :(得分:5)
将文本标记为句子然后标记为POS标记,您需要迭代标记句子列表,如下所示:
nerResult = [nltk.ne_chunk(pts) for pts in posTaggedSentences]
而不是这样:
nerResult = nltk.ne_chunk(posTaggedSentences)