我正在使用带有Python的Stanford命名实体识别器在小说“百年的独处”中找到合适的名字。其中许多由姓和名组成,例如“AurelianoBuendía”或“SantaSofíadela Piedad”。这些令牌总是分开的,例如“Aureliano”“Buendia”,因为我正在使用的标记器。 我想把它们放在一起作为代币,这样它们就可以和Stanford NER一起被标记为“PERSON”。
我写的代码:
import nltk
from nltk.tag import StanfordNERTagger
from nltk import word_tokenize
from nltk import FreqDist
sentence1 = open('book1.txt').read()
sentence = sentence1.split()
path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"
path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"
st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)
taggedSentence = st.tag(sentence)
def findtags (tagged_text,tag_prefix):
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in taggedSentence
if tag.endswith(tag_prefix))
return dict((tag, cfd[tag].most_common(1000)) for tag in cfd.conditions())
print (findtags('_','PERSON'))
结果如下:
{'PERSON':[('Aureliano',397),('José',294),('Arcadio',286),('Buendía',251),......
有人有解决方案吗?我会非常感激
答案 0 :(得分:0)
import nltk
from nltk.tag import StanfordNERTagger
sentence1 = open('book1.txt').read()
sentence = sentence1.split()
path_to_model = "C:\Python34\stanford-ner-2015-04-20\classifiers\english.muc.7class.distsim.crf.ser"
path_to_jar = "C:\Python34\stanford-ner-2015-04-20\stanford-ner.jar"
st = StanfordNERTagger(model_filename=path_to_model, path_to_jar=path_to_jar)
taggedSentence = st.tag(sentence)
test = []
test_dict = {}
for element in range(len(taggedSentence)):
a = ''
if element < len(taggedSentence):
while taggedSentence[element][1] == 'PERSON':
a += taggedSentence[element][0] + ' '
taggedSentence.pop(element)
if len(a) > 1:
test.append(a.strip())
test_dict[data.split('.')[0]] = tuple(test)
print(test_dict)