我正在使用stanford nlp的python包装器 查找命名实体的代码是:
sentence = "Mr. Jhon was noted to have a cyst at his visit back in 2011."
result = nlp.ner(sentence)
for ne in result:
if ne[1] == 'PERSON':
print(ne)
输出结果是列表类型: (你' Jhon',你' PERSON')
但它没有像spaCy或其他nlp工具那样给出命名实体的索引,它给出了带索引的结果。
>> namefinder = NameFinder.getNameFinder("spaCy")
>> entities = namefinder.find(sentences)
List(List((PERSON,0,13), (DURATION,15,27), (DATE,76,83)),
List((PERSON,4,10), (LOCATION,77,86), (ORGANIZATION,26,39)),
List((PERSON,0,13), (DURATION,16,28), (ORGANIZATION,52,80)))
答案 0 :(得分:0)
我正在使用nltk
。我改编了here的答案。关键点是调用use WordPunctTokenizer和方法span_tokenize()
来生成一个单独的列表,我调用spans
来保持每个令牌的范围。
from nltk.tag import StanfordNERTagger
from nltk.tokenize import WordPunctTokenizer
# Initialize Stanford NLP with the path to the model and the NER .jar
st = StanfordNERTagger(r"C:\stanford-corenlp\stanford-ner\classifiers\english.all.3class.distsim.crf.ser.gz",
r"C:\stanford-corenlp\stanford-ner\stanford-ner.jar",
encoding='utf-8')
sentence = "Mr. Jhon was noted to have a cyst at his visit back in 2011."
tokens = WordPunctTokenizer().tokenize(sentence)
# We have to compute the token spans in a separate list
# Notice that span_tokenize(sentence) returns a generator
spans = list(WordPunctTokenizer().span_tokenize(sentence))
# enumerate will help us keep track of the token index in the token lists
for i, ner in enumerate(st.tag(tokens)):
if ner[1] == "PERSON":
print spans[i], ner