NLTK pos_tag
单独标记“ Follett”和“ Rebecca”之类的名称时,无法工作,将其标记为NN而不是NNP。有趣的是,如果我将它们标记为一个列表,则可以正常工作。
我正在一本电子书上运行NER(通过Calibre将.epub转换为.txt)。我收集了本书中的所有单词(停用词除外),然后在此列表上运行pos_tag
。
转换后的文件的编码为UTF-8,但我确保Calibre音译了ASCII不支持的所有字符。
为了检查标记是否由于任何编码问题而失败,我将文本复制到Python IDLE中,然后在其上运行pos_tag
。我还手动键入了名称,但仍然无法正常工作。在这两种情况下,结果均为NN。
def process_file(_file, tagger, stemmer, stopwords, filename, printinfo):
sentences = []
_nnp = set()
words = dict()
for line in _file:
for sentence in nltk.tokenize.sent_tokenize(line):
sentences.append(nltk.tokenize.word_tokenize(sentence))
sent_count = 0
for sentence in sentences:
sent_count+=1
tags = tagger.tag(sentence)
for tag in tags:
if tag[1] == "NNP" or tag[1] == "NNPS":
_nnp.add(tag[0])
else:
if tag[0] not in stopwords:
stemmed_word = stemmer.stem(tag[0])
if stemmed_word not in words.keys():
words[stemmed_word] = 1
else:
words[stemmed_word] += 1
print("\r[{0}] Reading file '{1}'[{2:>3.1%}] ".format(printinfo, filename, sent_count/len(sentences)), end='')
return _nnp, words
主要代码:
dictionary = dict()
nnp = set()
# Initialize tagger and stemmer
pos_tagger = nltk.tag.PerceptronTagger()
ps = nltk.PorterStemmer()
stopwords = nltk.corpus.stopwords.words('english')
file_count = 0
# For each file, extract words and named entities
for file in files:
file_count += 1
_nnp, _dictionary = process_file(open(file, 'r', encoding='utf-8'), pos_tagger, ps, stopwords, file, str(file_count)+"/"+str(len(files)))
# Extend dictionary
for word in _dictionary.keys():
if word in dictionary.keys():
dictionary[word] += _dictionary[word]
else:
dictionary[word] = _dictionary[word]
# Join sets
nnp = nnp.union(_nnp)
end = time.time()
print("COMPLETED: Step 2 completed in {0:.3f}s".format(end-start))
以下是文本的示例:
In slow motion, afraid of what he was about to witness, Langdon rotated the fax 180 degrees. He looked at the word upside down.
Instantly, the breath went out of him. It was like he had been hit by a truck. Barely able to believe his eyes, he rotated the fax again, reading the brand right-side up and then upside down.
"Illuminati," he whispered.