Can I stop Stanford POS and NER taggers from removing "#" and "@" characters?

时间:2017-07-31 12:01:26

标签: stanford-nlp

I'm doing some processing with the Stanford NLP software. First of all, thanks everyone at Stanford for all this great stuff!!! Here's my conundrum:

I have a sentences that can have URLs ("http://"), email addresses ("&"), and hashtags ("#"). I'm using python to do the work. If I use the POS and NER tagging built into nltk, all these special characters are kept in their tokenized words. But it's very slow to use these since each call fires up a new java instance. So I've run the taggers in server mode instead. And when I pass the full sentences through, they come back with all those special characters stripped off. I'm using the python sner package to interface with the servers.

Here's what I mean. To use the nltk StanfordPOSTagger, you have to pass in a pre-tokenized sentence. I'm using the StanfordTokenizer.

>>>from nltk.tag.stanford import StanfordPOSTagger
>>>from nltk.tokenize import StanfordTokenizer
>>>import sner # https://pypi.python.org/pypi/sner

>>>sent="Here's an #example from me@y.ou url http://me.you"
>>>st=StanfordTokenizer(homedir+'models/stanford-postagger.jar',
                 options={"ptb3Ellipsis":False})
>>>nltk_pos=StanfordPOSTagger(homedir+'models/english-bidirectional-distsim.tagger',
                 homedir+'models/stanford-postagger.jar')
>>>pos_args=['java', '-mx300m', '-cp', homedir+'/models/stanford-postagger.jar',
             edu.stanford.nlp.tagger.maxent.MaxentTaggerServer','-model',
             homedir+'models/english-bidirectional-distsim.tagger','-port','2020']
>>>POS=Popen(pos_args)
>>>sp=sner(host="localhost",port=2020)

>>>nltk_pos.tag(st.tokenize(sent))
>>>[(u'Here', u'RB'), (u"'s", u'VBZ'), (u'an', u'DT'),
    (u'#example', u'NN'), (u'from', u'IN'), (u'me@y.ou', u'NN'),
    (u'url', u'NN'), (u'http://me.you', u'NN')]

>>>sp.tag(sent)
>>>[(u'Here', u'RB'), (u"'s", u'VBZ'), (u'an', u'DT'),
    (u'example', u'NN'), (u'from', u'IN'), (u'y.ou', u'NN'),
    (u'url', u'NN'), (u'//me.you', u'NN')]

I'm curious why the difference and if there is a way to get the servers to not strip out those characters? I've read that there are ways to pass flags to the POS server to use pre-tokenized text ("-tokenize false"), but I can't figure out how to pass that list of strings to the server with the python interface. In the sner package, the text to be parsed is sent as a single string, not a list of strings as is returned by a tokenizer.

-b

1 个答案:

答案 0 :(得分:0)

问题是因为nltk使用edu.stanford.nlp.process.WhitespaceTokenizer作为tokenizerFactory。

您可以像这样更改ner-server参数:

java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions tokenizeNLs=false