Question

当有潜在的NE后跟逗号时会发生这种情况，例如，如果我的字符串是这样的话，

“这些名字是Praveen Kumar，David Harrison，Paul Harrison，等等”

或

“加利福尼亚州，美国”

我的输出分别如下所示。

[[（u'These'，u'O'），（u'names'，u'O'），（u'Praveen'，u'O'），（u'Kumar ,,'，u '哦'，（u'David'，u'PERSON'），（u'Harrison，'，u'O'），（u'Paul'，u'PERSON'），（u'Harrison，'，你'O'），（u'blah'，u'O'）]]

或

[[（u'California，'，u'O'），（u'United'，u'LOCATION'），（u'States'，u'LOCATION'）]]

为什么它不承认潜在的NE，如“Praveen Kumar”，“Harrison”和“California”？

以下是如何在代码中使用它：

from nltk.tag.stanford import NERTagger
st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')

tags = st.tag("California, United States".split())

是不是因为我用split()标记了输入搅拌？我如何解决这个问题，因为它在Java中运行时工作正常？

Answer 1

由于您是通过nltk执行此操作，因此请使用其标记生成器来分割您的输入：

alltext = myfile.read()
tokenized_text = nltk.word_tokenize(alltext)

编辑：根据其他答案的建议，您最好使用stanford工具包自己的标记器。因此，如果您要将令牌提供给其中一个斯坦福工具，请将您的文本标记为这样，以获得工具所期望的完全标记化：

from nltk.tokenize.stanford import StanfordTokenizer
tokenize = StanfordTokenizer().tokenize

alltext = myfile.read()
tokenized_text = tokenize(alltext)

要使用此方法，您需要安装Stanford工具，并且nltk必须能够找到它们。我假设您已经处理过这个问题，因为您正在使用Stanford NER工具。

Answer 2

逗号需要是单独的令牌。仅仅使用split（）并没有实现这一点，因此NER标记器无法识别像“California”这样的标记。

如果你想在Java中使用Stanford CoreNLP获得类似的行为，我建议使用nltk包装器进行标记化：http://www.nltk.org/_modules/nltk/tokenize/stanford.html

在python NLTK中使用StanfordNER识别网元的问题

2 个答案: