Question

我正在使用TokensRegex进行基于规则的实体提取。它运行良好，但我无法以所需的格式获得输出。下面的代码片段给出了下面给出的句子输出：

本月早些时候，特朗普针对丰田，威胁要强加一辆如果它建造它的卡罗拉，世界上最大的汽车制造商的巨额费用在墨西哥的一家工厂为美国市场提供汽车。

MATCHED ENTITY: Donald Trump targeted Toyota, threatening to impose a hefty fee on the world's largest automaker if it builds its Corolla cars for the U.S. market  

VALUE: LIST([PERSON])

输出

for (CoreLabel token : cm.get(TokensAnnotation.class))
                    {String word = token.get(TextAnnotation.class);
                            String lemma = token.get(LemmaAnnotation.class);
                            String pos = token.get(PartOfSpeechAnnotation.class);
                            String ne = token.get(NamedEntityTagAnnotation.class);
                            System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + ", NE=" + ne);
}

我知道如果我使用以下方法迭代令牌：

MATCHED ENTITY: Donald Trump VALUE: PERSON
MATCHED ENTITY: Toyota VALUE: ORGANIZATION

我可以获得一个输出，为每个标签提供注释。但是，我使用自己的规则来检测命名实体，我有时会看到一些问题，在多标记实体中，可以将其中的一个单词标记为多标记表达式应该是组织的人（大多数情况下是组织和地点名称）

所以我期待的输出是：

Declare @D DateTime;

Set @D = GetDate();

while @D < DateAdd(Year, 10, GetDate())
    Begin

        Insert Into simple_table(Year, Month, Day) Select Year(@D), Month(@D), Day(@D)

        Set @D = DateAdd(Day, 1, @D)

    End

如何更改上述代码以获得所需的输出？我是否需要使用自定义注释？

Answer 1

我在大约一周前制作了一个最新版本的jar。使用GitHub提供的jar。
此示例代码将运行规则并应用适当的ner标记。
```
goToNextView
```

Answer 2

我设法以所需的格式获得输出。

Annotation document = new Annotation(<Sentence to annotate>);

//use the pipeline to annotate the document we created
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

//Note- I doesn't put environment related stuff in rule file.
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);


CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor
      .createExtractorFromFiles(env, "test_degree.rules");

for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      List<MatchedExpression> matched = extractor.extractExpressions(sentence);
      for(MatchedExpression phrase : matched){
      // Print out matched text and value
      System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
      }
    }

<强>输出：

MATCHED ENTITY: Technical Skill VALUE: SKILL

您可能想查看我的rule file in this question.

希望这有帮助！

Answer 3

为那些在类似问题上挣扎的人回答我自己的问题。以正确的格式输出输出的关键在于如何在规则文件中定义规则。以下是我在规则中更改输出的内容：

旧规则：

{    ruleType: "tokens",
     pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
     result: Annotate($1, ner, "LOCATION"),

}

新规则

{    ruleType: "tokens",
     pattern: (([pos:/NNP.*/ | pos:/NN.*/]+) ($LocWords)),
     action: Annotate($1, ner, "LOCATION"),
     result: "LOCATION"

}

如何定义结果字段定义数据的输出格式。

希望这有帮助！

使用TokenRegex以所需格式获取输出

3 个答案: