为什么Stanford NER无法识别所有名称?

时间:2018-11-09 10:58:31

标签: python stanford-nlp

我已经用Stanford NER标记器编写了Python脚本。由于篇幅所限,我将只介绍部分代码 一,读取文件

lines=[]
with open('dialogue.txt', encoding='utf-8-sig') as outfile:
    for line in outfile:
        line = line.strip()
        lines.append(line)

然后扩大收缩并删除空格字符

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

到目前为止我的输出

'[If we are all here let us get started First of all I would like you to please join me in welcoming Jack Peterson our Southwest Area Sales Vice President Thank you for having me I am looking forward to todays meeting I would also like to introduce Margaret Simmons who recently joined our team May I also introduce my assistant Bob Hamp Welcome Bob I am afraid our national sales director Anne Trusting cannot be with us today She is in Kobe at the moment developing our Far East sales force  Let us get started We are here today to discuss ways of improving sales in rural market areas First let us go over the report from the last meeting which was held on June 24th Right Tom over to you Thank you Mark Let me just summarize the main points of the last meeting We began the meeting by approving the changes in our sales reporting system discussed on May 30th After briefly revising the changes that will take place we moved on to a brainstorming session concerning after customer support improvements You will find a copy of the main ideas developed and discussed in these sessions in the photocopies posted in front of  you The meeting was declared closed at 1130 Petors is not coming todayprivate reasons  Thank you Tom So if there is nothing else we need to discuss let us move on to todays agenda Have you all received a copy of todays agenda If you do not mind I would like to skip item 1 and move on to item 2 Sales improvement in rural market areas Jack has kindly agreed to give us a report on this matter Jack  ]'

现在我们使用

nltk.tag.stanford.StanfordNERTagger
r=st.tag(doc.split())
for tag, chunk in groupby(r, lambda x:x[1]):
if tag != "O":
    print("%-12s"%tag, " ".join(w for w, t in chunk))

最终输出

PERSON       Jack
LOCATION     Southwest
PERSON       Margaret Simmons
LOCATION     Kobe
LOCATION     Far East
PERSON       Jack

不太好,没有识别出像Anne,Tom或Petros这样的名字。 OK Petros是希腊文名称,但是前两个我找不到任何解释。

0 个答案:

没有答案