为什么日期提及的NamedEntityAnnotator与CoreNLP demo的输出不同?

时间:2016-04-25 06:25:40

标签: nlp stanford-nlp named-entity-recognition

从我的以下程序中检测到的日期被分为两个单独的提及,而CoreNLP demo的NER输出中检测到的日期应该是单一的。我应该在我的程序中编辑什么来纠正这个问题。

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

String text =  "This software was released on Februrary 5, 2015.";
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
      List<CoreMap> mentions = sentence.get(MentionsAnnotation.class);
      if (mentions != null) {
              for (CoreMap mention : mentions) {
                     System.out.println("== Token=" + mention.get(TextAnnotation.class));
                     System.out.println("NER=" + mention.get(NamedEntityTagAnnotation.class));
                     System.out.println("Normalized NER=" + mention.get(NormalizedNamedEntityTagAnnotation.class));
              }
       }
}

该计划的输出:

== Token=Februrary 5,
NER=DATE
Normalized NER=****0205
== Token=2015
NER=DATE
Normalized NER=2015  

CoreNLP在线演示的输出: enter image description here

1 个答案:

答案 0 :(得分:2)

Note that the online demo is showing any sequence of consecutive tokens with the same NER tag as belonging to the same unit. Consider this sentence:

The event happened on February 5th January 9th.

This example yields "February 5th January 9th" as a single DATE in the online demo.

Yet it recognizes "February 5th" and "January 9th" as separate entity mentions.

Your sample code is looking at mentions, not NER chunks. Mentions are not being shown by the online demo.

That being said, I am not sure why SUTime is not joining February 5th and 2015 together in your example. Thanks for bringing this up, I will look into improving the module to fix this issue in future releases.