Question

我终于能够获得我的TokensRegex代码，为命名实体提供某种输出。但输出并不是我想要的。我相信规则需要一些调整。

以下是代码：

    public static void main(String[] args)
    {
        String  rulesFile = "D:\\Workspace\\resource\\NERRulesFile.rules.txt";
        String dataFile = "D:\\Workspace\\data\\GoldSetSentences.txt";

        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
        props.setProperty("ner.useSUTime", "0");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        pipeline.addAnnotator(new TokensRegexAnnotator(rulesFile));
        String inputText = "Bill Edelman, CEO and chairman of Paragonix Inc. announced that the company is expanding it's operations in China.";

        Annotation document = new Annotation(inputText);
        pipeline.annotate(document);
        Env env = TokenSequencePattern.getNewEnv();
        env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE); 
        env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, rulesFile);

        /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
        for (CoreMap sentence : sentences)
        {

            List<MatchedExpression> matched = extractor.extractExpressions(sentence);

            for(MatchedExpression phrase : matched){

                // Print out matched text and value
                System.out.println("matched: " + phrase.getText() + " with value: " + phrase.getValue());
                // Print out token information
                CoreMap cm = phrase.getAnnotation();
                for (CoreLabel token : cm.get(TokensAnnotation.class))
                {
                    if (token.tag().equals("NNP")){
                        String leftContext = token.before();
                        String rightContext = token.after();
                        System.out.println(leftContext);
                        System.out.println(rightContext);


                        String word = token.get(TextAnnotation.class);
                        String lemma = token.get(LemmaAnnotation.class);
                        String pos = token.get(PartOfSpeechAnnotation.class);
                        String ne = token.get(NamedEntityTagAnnotation.class);
                        System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne);
                    }

                }
            }
        }
    }
}

这是规则文件：

$TITLES_CORPORATE  = (/chief/ /administrative/ /officer/|cao|ceo|/chief/ /executive/ /officer/|/chairman/|/vice/ /president/)
$ORGANIZATION_TITLES = (/International/|/inc\./|/corp/|/llc/)

# For detecting organization names like 'Paragonix Inc.' 

{    ruleType: "tokens",
     pattern: ([{pos: NNP}]+ $ORGANIZATION_TITLES),
     action: ( Annotate($0, ner, "ORGANIZATION"),Annotate($1, ner, "ORGANIZATION") ) 
}

# For extracting organization names from a pattern - 'Genome International is planning to expand its operations in China.' 
#(in the sentence given above the words planning and expand are part of the $OrgContextWords macros )
{
  ruleType: "tokens",
  pattern: (([{tag:/NNP.*/}]+) /,/*? /is|had|has|will|would/*? /has|had|have|will/*? /be|been|being/*? (?:[]{0,5}[{lemma:$OrgContextWords}]) /of|in|with|for|to|at|like|on/*?),
  result:  ( Annotate($1, ner, "ORGANIZATION") ) 
}

# For sentence like - Bill Edelman, Chairman and CEO of Paragonix Inc./ Zuckerberg CEO Facebook said today....  

ENV.defaults["stage"] = 1
{
  pattern: ( $TITLES_CORPORATE ), 
  action: ( Annotate($1, ner, "PERSON_TITLE")) 
}

ENV.defaults["stage"] = 2 
{
  ruleType: "tokens",
  pattern:  ( ([ { pos:NNP} ]+) /,/*? (?:TITLES_CORPORATE)? /and|&/*? (?:TITLES_CORPORATE)? /,/*? /of|for/? /,/*? [ { pos:NNP } ]+ ),
  result: (Annotate($1, ner, "PERSON"),Annotate($2, ner, "ORGANIZATION"))
}

我得到的输出是：

matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=PERSON
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION

我期待的输出是：

matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=ORGANIZATION
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION

此外，比尔埃德尔曼也没有被认定为人。包含比尔埃德尔曼的短语虽然我已经制定了规则，但并没有被确定。我是否需要为整个短语制定规则以匹配每个规则，因此不会错过任何实体？

Answer 1

我在主GitHub页面（截至4月14日）制作了一个代表最新Stanford CoreNLP的jar。

此命令（使用最新代码）应该适用于使用TokensRegexAnnotator（或者，如果使用Java API，则可以将tokensregex设置传递给Properties对象）：

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules example.rules -tokensregex.caseInsensitive -file example.txt -outputFormat text

这是我编写的规则文件，显示基于句型的匹配：

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

$ORGANIZATION_TITLES = "/inc\.|corp\./"

$COMPANY_INDICATOR_WORDS = "/company|corporation/"

{ pattern: (([{pos: NNP}]+ $ORGANIZATION_TITLES) /is/ /a/ $COMPANY_INDICATOR_WORDS), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }

{ pattern: ($COMPANY_INDICATOR_WORDS /that/ ([{pos: NNP}]+) /works/ /for/), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }

请注意，$0表示整个模式，$1表示第一个捕获组。所以在这个例子中，我在文本周围添加了一个额外的括号，表示我想要匹配的内容。

我在示例上运行了这个：Paragonix Inc. is a company that Joe Smith works for.

此示例显示使用第二轮第一轮的提取：

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

$ORGANIZATION_TITLES = "/inc\.|corp\./"

$COMPANY_INDICATOR_WORDS = "/company|corporation/"

ENV.defaults["stage"] = 1

{ pattern: (/works/ /for/ ([{pos: NNP}]+ $ORGANIZATION_TITLES)), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }

ENV.defaults["stage"] = 2

{ pattern: (([{pos: NNP}]+) /works/ /for/ [{ner: "RULE_FOUND_ORG"}]), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }

此示例应适用于句子Joe Smith works for Paragonix Inc.

TokensRegex规则为命名实体获取正确的输出

1 个答案: