使用令牌正则表达式规则识别命名实体时出现空白结果

时间:2017-04-07 12:23:59

标签: macros stanford-nlp

我正在努力编写正确的规则,该规则涉及用于识别文本中的组织的宏。

识别Matrix Inc.:

随着股价上涨,Matrix Inc.本季度已成为赢家。

我正在尝试在实体中检查像Inc这样的单词,从而定义了一个宏和规则如下:

$ORGANIZATION_TITLES = "/pharmaceuticals?|group|corp|corporation|international|co.?|inc.?|incorporated|holdings|motors|ventures|parters|llc|limited liability corporation|pvt.? ltd.?/"

ENV.defaults["stage"] = 1
 {
  ruleType: "tokens",
  pattern: ([$ORGANIZATION_TITLES]), 
  action:  ( Annotate($0, ner, "ORGANIZATION") )
}

 ENV.defaults["stage"] = 2
 { ( [{tag:NNP}]+? ($ORGANIZATION_TITLES)) => ORGANIZATION }

我也尝试使用绑定然后应用规则。

env.bind("$ORGANIZATION_TITLES", TokenSequencePattern.compile(env,"/pharmaceuticals?|group|corp|corporation|international|co.?|inc.?|incorporated|holdings|motors|ventures|parters|llc|limited liability corporation|pvt.? ltd.?/"));

似乎没有任何效果。我需要定义更复杂的模式规则,包括:

pattern:  ( [ { ner:PERSON } ]+ /,/*? ($TITLES_CORPORATE_PREFIXES)*? $TITLES_CORPORATE+? /,/*? /of|for/? /,/*? [ { ner:ORGANIZATION } ]+ )

其中$ TITLES_CORPORATE_PREFIXES和$ TITLES_CORPORATE是类似$ ORGANIZATION_TITLES的宏。

我做错了什么?

修改

这是我的代码:

public static void main(String[] args)
    {
        String  rulesFile = "D:\\Workspace\\resource\\NERRulesFile.txt";
        String dataFile = "D:\\Workspace\\resource\\GoldSetSentences.txt";

        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        // pipeline.addAnnotator(new TokensRegexAnnotator(rulesFile));
        String inputText = "Bill Edelman , CEO and Chairman , for Paragonix commented on the Supply Agreement with Essential Pharmaceuticals .";


        Annotation document = new Annotation(inputText.toLowerCase());
        pipeline.annotate(document);
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);
        CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), rulesFile);
        /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
        for (CoreMap sentence : sentences)
        {

            List<MatchedExpression> matched = extractor.extractExpressions(sentence);

            for(MatchedExpression phrase : matched){

                // Print out matched text and value
                System.out.println("matched: " + phrase.getText() + " with value " + phrase.getValue());
                // Print out token information
                CoreMap cm = phrase.getAnnotation();
                for (CoreLabel token : cm.get(TokensAnnotation.class))
                {

                    String word = token.get(TextAnnotation.class);
                    String lemma = token.get(LemmaAnnotation.class);
                    String pos = token.get(PartOfSpeechAnnotation.class);
                    String ne = token.get(NamedEntityTagAnnotation.class);
                    System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne);
                }
            }
        }

    }

1 个答案:

答案 0 :(得分:0)

这是一个应该有效的规则文件:

ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }

$ORGANIZATION_TITLES = "/inc\.|corp\./"

{ pattern: ([{pos: NNP}]+ $ORGANIZATION_TITLES), action: ( Annotate($0, ner, "RULE_FOUND_ORG") ) }

我对代码库进行了一些更改,以便更容易访问TokensRegexAnnotator。您需要从GitHub获取最新版本:https://github.com/stanfordnlp/CoreNLP

java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules organization.rules -file samples.txt -outputFormat text -tokensregex.caseInsensitive

如果您运行此命令或等效的Java API调用,它应该可以工作: