Question

我想在python中使用StanfordNER来检测命名实体。我应该如何清理句子？

例如，考虑

qry="In the UK, the class is relatively crowded with Zacc competing with Abc's Popol (market leader) and Xyz's Abcvd."

如果我这样做

st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
print st.tag(qry.split())

我得到了

[
    (u'In', u'O'), (u'the', u'O'), (u'UK,', u'O'), (u'the', u'O'), 
    (u'class', u'O'), (u'is', u'O'), (u'relatively', u'O'), (u'crowded', u'O'), 
    (u'with', u'O'), (u'Zacc', u'PERSON'), (u'competing', u'O'), (u'with', u'O'), 
    (u"Abc's", u'O'), (u'Popol', u'O'), (u'(market', u'O'), (u'leader)', u'O'), 
    (u'and', u'O'), (u"Xyz's", u'O'), (u'Abcvd.', u'O')
]

`

因此仅检测到1个命名实体。但是，如果我通过用空格替换所有特殊字符来进行一些清理

qry="In the UK the class is relatively crowded with Zacc competing with Abc s Popol market leader and Xyz s Abcvd"

我得到了

[
    (u'In', u'O'), (u'the', u'O'), (u'UK', u'LOCATION'), (u'the', u'O'), 
    (u'class', u'O'), (u'is', u'O'), (u'relatively', u'O'), (u'crowded', u'O'), 
    (u'with', u'O'), (u'Zacc', u'PERSON'), (u'competing', u'O'), (u'with', u'O'), 
    (u'Abc', u'ORGANIZATION'), (u's', u'O'), (u'Popol', u'PERSON'), (u'market', u'O'), 
    (u'leader', u'O'), (u'and', u'O'), (u'Xyz', u'ORGANIZATION'), (u's', u'O'), (u'Abcvd', u'PERSON')]

`

如此清楚，这更合适。有关如何清理StanfordNER句子的一般规则吗？最初我认为根本不需要清理！

Answer 1

您可以将Stanford Tokenizer用于您的目的。您可以使用以下代码。

from nltk.tokenize.stanford import StanfordTokenizer
token = StanfordTokenizer('stanford-ner-2014-06-16/stanford-ner.jar')
qry="In the UK, the class is relatively crowded with Zacc competing with Abc's Popol (market leader) and  Xyz's Abcvd."
tok = token.tokenize(qry)
print tok

您将根据需要获得令牌。

[U＆＃39;在＆＃39 ;,
   U＆＃39;所述＆＃39 ;,
   ü＆＃39;英国＆＃39 ;,
   U＆＃39;＆＃39 ;,
   U＆＃39;所述＆＃39 ;,
   U＆＃39;类＆＃39 ;,
   U＆＃39;是＆＃39 ;,
   U＆＃39;相对＆＃39 ;,
   U＆＃39;拥挤＆＃39 ;,
   U＆＃39;与＆＃39 ;,
   U＆＃39;扎克＆＃39 ;,
   U＆＃39;竞争＆＃39 ;,
   U＆＃39;与＆＃39 ;,
   U＆＃39;美国广播公司＆＃39 ;,
   U＆＃34;＆＃39; S＆＃34 ;,
   U＆＃39;波波＆＃39 ;,
   U＆＃39; -LRB - ＆＃39 ;,
   ü＆＃39;市场＆＃39 ;,
   ü＆＃39;首领＆＃39 ;,
   U＆＃39; -RRB - ＆＃39 ;,
   U＆＃39;和＆＃39 ;,
   U＆＃39;的Xyz＆＃39 ;,
   U＆＃34;＆＃39; S＆＃34 ;,
   U＆＃39; Abcvd＆＃39 ;,
   你＆＃39;。

Answer 2

你应该确保你正在对句子进行标记 - 这是第一次调用（你用qry.split()错误地隐式标记）的第二次调用与你手动标记的第二次调用之间的巨大差异（例如，posessive 's作为自己的标记）。斯坦福does have a tokenizer，这是NER系统训练的标记器，虽然我不是如何从Python调用它的专家。简单地不拆分句子会为你标记它吗？

Answer 3

请在处理之前对文字进行单词标记。另外，请注意大多数注释系统都是从句子中训练的，因此您可以在单词标记化之前进行句子标记化。

alvas@ubi:~$ export STANFORDTOOLSDIR=$HOME
alvas@ubi:~$ export CLASSPATH=$STANFORDTOOLSDIR/stanford-ner-2015-12-09/stanford-ner.jar
alvas@ubi:~$ export STANFORD_MODELS=$STANFORDTOOLSDIR/stanford-ner-2015-12-09/classifiers
alvas@ubi:~$ python
Python 2.7.11 (default, Dec 15 2015, 16:46:19) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import word_tokenize
>>> from nltk.tag import StanfordNERTagger
>>> from nltk.internals import find_jars_within_path
>>> st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
>>> stanford_dir = st._stanford_jar.rpartition('/')[0]
>>> stanford_jars = find_jars_within_path(stanford_dir)
>>> st._stanford_jar = ':'.join(stanford_jars)
>>> 
>>> text = "In the UK, the class is relatively crowded with Zacc competing with Abc's Popol (market leader) and  Xyz's Abcvd."
>>> text = word_tokenize(text)
>>> text
['In', 'the', 'UK', ',', 'the', 'class', 'is', 'relatively', 'crowded', 'with', 'Zacc', 'competing', 'with', 'Abc', "'s", 'Popol', '(', 'market', 'leader', ')', 'and', 'Xyz', "'s", 'Abcvd', '.']
>>> st.tag(text)
[(u'In', u'O'), (u'the', u'O'), (u'UK', u'LOCATION'), (u',', u'O'), (u'the', u'O'), (u'class', u'O'), (u'is', u'O'), (u'relatively', u'O'), (u'crowded', u'O'), (u'with', u'O'), (u'Zacc', u'PERSON'), (u'competing', u'O'), (u'with', u'O'), (u'Abc', u'PERSON'), (u"'s", u'O'), (u'Popol', u'O'), (u'(', u'O'), (u'market', u'O'), (u'leader', u'O'), (u')', u'O'), (u'and', u'O'), (u'Xyz', u'ORGANIZATION'), (u"'s", u'O'), (u'Abcvd', u'O'), (u'.', u'O')]

如何清理StanfordNER的句子

3 个答案: