因此,我试图编写文本预处理器并尝试使nltk.ne_chunk()正常工作,但是我在以下代码中遇到了很多错误
z = "Francois Legault of the CAQ will now become the new premier of Quebec. This is possible as his party defeated the Liberals in the Provincial elections held on October 1st 2018."
def preprocess_pipe1(doc1):
sent1 = nltk.sent_tokenize(doc1)
#print(sent1)
print(" ")
print ("SENTENCE SPLITTER")
for x in sent1:
print(x)
print(" ")
sent1 = [nltk.word_tokenize(sent2) for sent2 in sent1]
#print(sent1)
print(" ")
print ("TOKENIZER")
for x in sent1:
print(x)
print(" ")
sent1 = [nltk.pos_tag(sent2) for sent2 in sent1]
#print(sent1)
print(" ")
print ("POS TAGGER")
for x in sent1:
print(x)
return(sent1)
sent2=preprocess_pipe1(z)
sent3=nltk.ne_chunk(sent2)
print(sent3)
` 错误如下
句子分割器 CAQ的Francois Legault现在将成为魁北克的新总理。 这是有可能的,因为他的政党在2018年10月1日举行的省级选举中击败了自由党。
代币
['Francois', 'Legault', 'of', 'the', 'CAQ', 'will', 'now', 'become', 'the', 'new', 'premier', 'of', 'Quebec', '.']
['This', 'is', 'possible', 'as', 'his', 'party', 'defeated', 'the', 'Liberals', 'in', 'the', 'Provincial', 'elections', 'held', 'on', 'October', '1st', '2018', '.']
POS TAGGER
[('Francois', 'NNP'), ('Legault', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('CAQ', 'NNP'), ('will', 'MD'), ('now', 'RB'), ('become', 'VB'), ('the', 'DT'), ('new', 'JJ'), ('premier', 'NN'), ('of', 'IN'), ('Quebec', 'NNP'), ('.', '.')]
[('This', 'DT'), ('is', 'VBZ'), ('possible', 'JJ'), ('as', 'IN'), ('his', 'PRP$'), ('party', 'NN'), ('defeated', 'VBD'), ('the', 'DT'), ('Liberals', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('Provincial', 'NNP'), ('elections', 'NNS'), ('held', 'VBD'), ('on', 'IN'), ('October', 'NNP'), ('1st', 'CD'), ('2018', 'CD'), ('.', '.')]
错误:
跟踪(最近一次通话最近):文件“ C:/ Users / Robin Karlose / PycharmProjects / NLTK测试1 /代码5-NER test.py“,第71行,在 sent3 = nltk.ne_chunk(sent2)文件“ C:\ Users \ Robin Karlose \ PycharmProjects \ NLTK测试 1 \ venv \ lib \ site-packages \ nltk \ chunk__init __。py“,第177行,在 ne_chunk 返回chunker.parse(tagged_tokens)文件“ C:\ Users \ Robin Karlose \ PycharmProjects \ NLTK测试 1 \ venv \ lib \ site-packages \ nltk \ chunk \ named_entity.py“,第123行,在 解析 已标记= self._tagger.tag(令牌)文件“ C:\ Users \ Robin Karlose \ PycharmProjects \ NLTK测试 标签中的1 \ venv \ lib \ site-packages \ nltk \ tag \ sequential.py“,第63行 tags.append(self.tag_one(tokens,i,tags))文件“ C:\ Users \ Robin Karlose \ PycharmProjects \ NLTK测试 1 \ venv \ lib \ site-packages \ nltk \ tag \ sequential.py“,第83行,位于tag_one中 标签= tagger.choose_tag(令牌,索引,历史记录)文件“ C:\ Users \ Robin Karlose \ PycharmProjects \ NLTK测试 1 \ venv \ lib \ site-packages \ nltk \ tag \ sequential.py“,第632行,在 选择标签 featureset = self.feature_detector(令牌,索引,历史记录)文件“ C:\ Users \ Robin Karlose \ PycharmProjects \ NLTK测试 1 \ venv \ lib \ site-packages \ nltk \ tag \ sequential.py“,行680,在 feature_detector 返回self._feature_detector(令牌,索引,历史记录)文件“ C:\ Users \ Robin Karlose \ PycharmProjects \ NLTK测试 1 \ venv \ lib \ site-packages \ nltk \ chunk \ named_entity.py“,第56行 _feature_detector pos = simple_pos(tokens [index] [1])文件“ C:\ Users \ Robin Karlose \ PycharmProjects \ NLTK测试 1 \ venv \ lib \ site-packages \ nltk \ chunk \ named_entity.py“,第186行,在 simple_pos 如果s.startswith('V'):返回“ V” AttributeError:'tuple'对象没有属性'startswith'
很有趣的是,当我运行此代码时,NER运行正常
import nltk
import nltk.corpus
sent = nltk.corpus.treebank.tagged_sents()[22]
print(sent)
print(nltk.ne_chunk(sent))
据我所知-在这两种情况下,我都会将POS标记的文本发送到名为实体识别功能的NLTK(即nltk.ne_chunk()),但是对于我来说,我一生都无法理解为什么其中有这么多错误第一种情况。
如果有人能对此事提供一些见解,我将不胜感激!