Question

我正在与Stanford CoreNLP合作并将其用于NER。但是当我提取组织名称时，我看到每个单词都用注释标记。因此，如果该实体是“纽约时报”，那么它将被记录为三个不同的实体：“NEW”，“YORK”和“TIMES”。我们可以在Stanford COreNLP中设置一个属性，以便我们可以将组合输出作为实体吗？

就像在Stanford NER中一样，当我们使用命令行实用程序时，我们可以选择输出格式为：inlineXML？我们可以以某种方式设置属性来选择Stanford CoreNLP中的输出格式吗？

Answer 1

如果您只想要斯坦福NER发现的每个命名实体的完整字符串，请尝试：

String text = "<INSERT YOUR INPUT TEXT HERE>";
AbstractSequenceClassifier<CoreMap> ner = CRFClassifier.getDefaultClassifier();
List<Triple<String, Integer, Integer>> entities = ner.classifyToCharacterOffsets(text);
for (Triple<String, Integer, Integer> entity : entities)
    System.out.println(text.substring(entity.second, entity.third), entity.second));

如果您想知道，实体类由entity.first表示。

或者，您可以使用ner.classifyWithInlineXML(text)获取类似<PERSON>Bill Smith</PERSON> went to <LOCATION>Paris</LOCATION> .

的输出

Answer 2

不，CoreNLP 3.5.0没有合并NER标签的实用程序。下一个版本（下周某个时候）会有一个新的MentionsAnnotator来处理这个合并。目前，您可以（a）使用CoreNLP master branch上提供的MentionsAnnotator或（b）手动合并。

使用-outputFormat xml选项获取CoreNLP输出XML。（这是你想要的吗？）

Answer 3

您可以在属性文件中设置任何属性，包括＆＃34; outputFormat＆＃34;属性。 Stanford CoreNLP支持几种不同的格式，如json，xml和text。但是，xml选项不是inlineXML格式。 xml格式为NER提供每个标记注释。

    <tokens> 
      <token id="1"> 
        <word>New</word> 
        <lemma>New</lemma> 
        <CharacterOffsetBegin>0</CharacterOffsetBegin> 
        <CharacterOffsetEnd>3</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
      <token id="2"> 
        <word>York</word> 
        <lemma>York</lemma> 
        <CharacterOffsetBegin>4</CharacterOffsetBegin> 
        <CharacterOffsetEnd>8</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
      <token id="3"> 
        <word>Times</word> 
        <lemma>Times</lemma> 
        <CharacterOffsetBegin>9</CharacterOffsetBegin> 
        <CharacterOffsetEnd>14</CharacterOffsetEnd> 
        <POS>NNP</POS> 
        <NER>ORGANIZATION</NER> 
        <Speaker>PER0</Speaker> 
      </token> 
    </tokens>

Answer 4

从Stanford CoreNLP 3.6及其后，您可以在Pipeline中使用 entitymentions 并获取所有实体的列表。我在这里展示了一个例子。它有效。

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, regexner,entitymentions");
props.put("regexner.mapping", "jg-regexner.txt");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


String inputText = "I have done Bachelor of Arts and Bachelor of Laws so that I can work at British Broadcasting Corporation"; 
Annotation annotation = new Annotation(inputText);

pipeline.annotate(annotation); 

List<CoreMap> multiWordsExp = annotation.get(MentionsAnnotation.class);
for (CoreMap multiWord : multiWordsExp) {
      String custNERClass = multiWord.get(NamedEntityTagAnnotation.class);
      System.out.println(multiWord +" : " +custNERClass);
}

格式化Stanford Corenlp的NER输出

4 个答案: