来自Stanford-POS-Tagger的令人惊讶的标签

时间:2016-02-12 12:35:11

标签: java nlp stanford-nlp pos-tagger

我在以下文字中使用Stanford-POS-Tagger(来自印度时报新闻报道关于印度超级联赛下的玩家拍卖):

  

皇家挑战者班加罗尔习惯于在...发表强有力的声明   印度超级联赛拍卖会上他们周六再次拍卖   (2月6日)与经验丰富的澳大利亚人签约   全能的Shane Watson。这支球队的赔率达到了9.5亿卢比   为这位34岁的老人买单让他成为今年最昂贵的买单。

     

Vijay Mallya拥有的一方在新的竞争中脱颖而出   参赛者瑞星浦那超级巨星和卫冕冠军孟买   印第安人将成为前拉贾斯坦邦皇家队的明星。沃森,一场战斗   右撇子击球手和方便的中等节奏,将加重严重的咬合   Virat Kohli领导的班加罗尔队仍在追逐他们的首个冠军。

对于最后一句,在II-para中,Stanford-POS-Tagger将第一个单词'Watson'标记为基本动词!我搜索Chambers' Twentieth Century Dictionary以查看单词'watson'是否是动词,但我找不到这样的条目!

我从我在代码中运行的一些函数中获得了以下输出:

  

Watson,VB aDT战斗VBG右JJ递给NN击球手NN和CC handyJJ   mediumNN pacer,NN willMD addVB seriousJJ biteNN to to theTT ViratNNP   KohliNNP ledVBD BengaluruNNP sideNN stillRB追逐他们的PRP $   maidenJJ title.NN

3 个答案:

答案 0 :(得分:4)

问题似乎是您在POS标记之前没有对文本进行标记。

正如@ChristopherManning所示,如果您在标记之前对文本进行了标记,那么Stanford POS标记器的输出将是正确的。

在命令行上使用CoreNLP

alvas@ubi:~/stanford-corenlp-full-2015-12-09$ echo """Royal Challengers Bangalore are used to making strong statements at the Indian Premier League auctions and they did so again on Saturday (February 6) with the marquee signing of seasoned Australian all-rounder Shane Watson. The staggering Rs 9.5 crore that the team paid for the 34-year-old made him the costliest buy this year.

The Vijay Mallya-owned side fought off stiff competition from new entrants Rising Pune Supergiants and defending champions Mumbai Indians to snare the former Rajasthan Royals star. Watson, a battling right-handed batsman and handy medium-pacer, will add serious bite to the Virat Kohli-led Bengaluru side still chasing their maiden title.""" > watson.txt
alvas@ubi:~/stanford-corenlp-full-2015-12-09$ 
alvas@ubi:~/stanford-corenlp-full-2015-12-09$ java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -outputFormat json -file watson.txt
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.6 sec].

Processing file /home/alvas/stanford-corenlp-full-2015-12-09/watson.txt ... writing to /home/alvas/stanford-corenlp-full-2015-12-09/watson.txt.json
Annotating file /home/alvas/stanford-corenlp-full-2015-12-09/watson.txt
done.
Annotation pipeline timing information:
TokenizerAnnotator: 0.1 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.1 sec.
TOTAL: 0.1 sec. for 110 tokens at 791.4 tokens/sec.
Pipeline setup: 1.6 sec.
Total time for StanfordCoreNLP pipeline: 1.9 sec

输出将保存在watson.txt.json中,并带有一些修改:

>>> import json
>>> with open('watson.txt.json') as fin:
...     output = json.load(fin)
... 
>>> for sent in output['sentences']:
...     print ' '.join([tok['word']+'/'+tok['pos'] for tok in sent['tokens']]) + '\n'
... 

Royal/NNP Challengers/NNS Bangalore/NNP are/VBP used/VBN to/TO making/VBG strong/JJ statements/NNS at/IN the/DT Indian/JJ Premier/NNP League/NNP auctions/NNS and/CC they/PRP did/VBD so/RB again/RB on/IN Saturday/NNP -LRB-/-LRB- February/NNP 6/CD -RRB-/-RRB- with/IN the/DT marquee/JJ 

signing/NN of/IN seasoned/JJ Australian/JJ all-rounder/NN Shane/NNP Watson/NNP ./.

The/DT staggering/JJ Rs/NN 9.5/CD crore/VBP that/IN the/DT team/NN paid/VBN for/IN the/DT 34-year-old/JJ made/VBD him/PRP the/DT costliest/JJS buy/VB this/DT year/NN ./.

The/DT Vijay/NNP Mallya-owned/JJ side/NN fought/VBD off/RP stiff/JJ competition/NN from/IN new/JJ entrants/NNS Rising/VBG Pune/NNP Supergiants/NNPS and/CC defending/VBG champions/NNS Mumbai/NNP Indians/NNPS to/TO snare/VB the/DT former/JJ Rajasthan/NNP Royals/NNPS star/NN ./.

Watson/NNP ,/, a/DT battling/VBG right-handed/JJ batsman/NN and/CC handy/JJ medium-pacer/NN ,/, will/MD add/VB serious/JJ bite/NN to/TO the/DT Virat/NNP Kohli-led/NNP Bengaluru/NNP side/NN still/RB chasing/VBG their/PRP$ maiden/JJ title/NN ./.

请注意,如果您在命令行上使用Stanford CoreNLP,它将 NOT 允许您在没有标记化的情况下使用POS标记:

alvas@ubi:~/stanford-corenlp-full-2015-12-09$ java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators pos -outputFormat json -file watson.txt[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.4 sec].
Exception in thread "main" java.lang.IllegalArgumentException: annotator "pos" requires annotator "tokenize"
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:375)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:139)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:135)
    at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1214)

无论您是通过GUI,命令行,python API使用Stanford POS标记器,还是直接通过Java代码中的库导入,都建议对您的文本进行标记,然后在每个句子之前对每个句子进行标记。 POS标记它们。

Stanford CoreNLP API提供了一个如何使用Java注释数据的示例:http://stanfordnlp.github.io/CoreNLP/api.html

答案 1 :(得分:1)

我无法使用当前版本(3.6.0)的默认模型或更慢,更好的模型重现此标记....(但一般来说,标记器不限于标记词典并且可以选择它认为最适合的标签。)

$ java -cp "*" edu.stanford.nlp.tagger.maxent.MaxentTagger -model edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger -textFile watson.txt 

Loading default properties from tagger edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger

Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].

Royal_NNP Challengers_NNS Bangalore_NNP are_VBP used_VBN to_TO making_VBG strong_JJ statements_NNS at_IN the_DT Indian_JJ Premier_NNP League_NNP auctions_NNS and_CC they_PRP did_VBD so_RB again_RB on_IN Saturday_NNP -LRB-_-LRB- February_NNP 6_CD -RRB-_-RRB- with_IN the_DT marquee_JJ signing_NN of_IN seasoned_JJ Australian_JJ all-rounder_NN Shane_NNP Watson_NNP ._.
The_DT staggering_JJ Rs_NN 9.5_CD crore_VBP that_IN the_DT team_NN paid_VBN for_IN the_DT 34-year-old_JJ made_VBD him_PRP the_DT costliest_JJS buy_VB this_DT year_NN ._.
The_DT Vijay_NNP Mallya-owned_JJ side_NN fought_VBD off_RP stiff_JJ competition_NN from_IN new_JJ entrants_NNS Rising_VBG Pune_NNP Supergiants_NNPS and_CC defending_VBG champions_NNS Mumbai_NNP Indians_NNPS to_TO snare_VB the_DT former_JJ Rajasthan_NNP Royals_NNPS star_NN ._.
Watson_NNP ,_, a_DT battling_VBG right-handed_JJ batsman_NN and_CC handy_JJ medium-pacer_NN ,_, will_MD add_VB serious_JJ bite_NN to_TO the_DT Virat_NNP Kohli-led_NNP Bengaluru_NNP side_NN still_RB chasing_VBG their_PRP$ maiden_JJ title_NN ._.

Tagged 110 words at 859.38 words per second.

答案 2 :(得分:0)

使用maven包我无法复制这种行为。确保使用正确的标记器。我使用的是默认值:

Royal_NNP Challengers_NNS Bangalore_NNP are_VBP used_VBN to_TO making_VBG strong_JJ statements_NNS at_IN the_DT Indian_JJ Premier_NNP League_NNP auctions_NNS and_CC they_PRP did_VBD so_RB again_RB on_IN Saturday_NNP -LRB-_-LRB- February_NNP 6_CD -RRB-_-RRB- with_IN the_DT marquee_JJ signing_NN of_IN seasoned_JJ Australian_JJ all-rounder_NN Shane_NNP Watson_NNP ._. The_DT staggering_JJ Rs_NN 9.5_CD crore_VBP that_IN the_DT team_NN paid_VBN for_IN the_DT 34-year-old_JJ made_VBD him_PRP the_DT costliest_JJS buy_VB this_DT year.The_NNP Vijay_NNP Mallya-owned_JJ side_NN fought_VBD off_RP stiff_JJ competition_NN from_IN new_JJ entrants_NNS Rising_VBG Pune_NNP Supergiants_NNPS and_CC defending_VBG champions_NNS Mumbai_NNP Indians_NNPS to_TO snare_VB the_DT former_JJ Rajasthan_NNP Royals_NNPS star_NN ._. Watson_NNP ,_, a_DT battling_VBG right-handed_JJ batsman_NN and_CC handy_JJ medium-pacer_NN ,_, will_MD add_VB serious_JJ bite_NN to_TO the_DT Virat_NNP Kohli-led_NNP Bengaluru_NNP side_NN still_RB chasing_VBG their_PRP$ maiden_JJ title_NN ._.

此代码返回

{{1}}