Question

我在CoreNLP中使用RegexNER注释器，我的一些命名实体由多个单词组成。摘自我的映射文件：

RAF抑制剂DRUG_CLASS

吉尔伯特综合症疾病

第一个被检测到，但每个单词都得到注释DRUG_CLASS，似乎无法链接这些单词，就像两个单词都有的NER ID一样。

根本没有检测到第二种情况，这可能是因为标记器在Gilbert作为单独的标记之后处理了撇号。由于RegexNER将标记化作为依赖关系，我无法真正解决它。

有任何解决这些案件的建议吗？

Answer 1

如果您使用entitymentions注释器，将使用相同的ner标签从连续令牌中创建实体提及。有一个缺点是，如果两个相同类型的实体并排，它们将连接在一起。我们正在努力改进ner系统，因此我们可能会包含一个新模型，在这些情况下找到不同提及的边界，希望这将进入Stanford CoreNLP 3.8.0。

以下是访问实体提及的一些示例代码：

package edu.stanford.nlp.examples;

import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;

import java.util.*;

public class EntityMentionsExample {

  public static void main(String[] args) {
    Annotation document =
        new Annotation("John Smith visted Los Angeles on Tuesday.");
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.annotate(document);

    for (CoreMap entityMention : document.get(CoreAnnotations.MentionsAnnotation.class)) {
      System.out.println(entityMention);
      System.out.println(entityMention.get(CoreAnnotations.TextAnnotation.class));
    }
  }
}

如果您只是将规则标记为与标记生成器相同的方式，它将正常工作，因此例如规则应为Gilbert 's syndrome。

因此，您可以在所有文本模式上运行tokenizer，这个问题就会消失。

如何使用CoreNLP的RegexNER检测多于1个单词的命名实体？

1 个答案: