我终于能够获得我的TokensRegex代码,为命名实体提供某种输出。但输出并不是我想要的。我相信规则需要一些调整。
以下是代码:
public static void main(String[] args)
{
String rulesFile = "D:\\Workspace\\resource\\NERRulesFile.rules.txt";
String dataFile = "D:\\Workspace\\data\\GoldSetSentences.txt";
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
props.setProperty("ner.useSUTime", "0");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.addAnnotator(new TokensRegexAnnotator(rulesFile));
String inputText = "Bill Edelman, CEO and chairman of Paragonix Inc. announced that the company is expanding it's operations in China.";
Annotation document = new Annotation(inputText);
pipeline.annotate(document);
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, rulesFile);
/* Next we can go over the annotated sentences and extract the annotated words,
Using the CoreLabel Object */
for (CoreMap sentence : sentences)
{
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.println("matched: " + phrase.getText() + " with value: " + phrase.getValue());
// Print out token information
CoreMap cm = phrase.getAnnotation();
for (CoreLabel token : cm.get(TokensAnnotation.class))
{
if (token.tag().equals("NNP")){
String leftContext = token.before();
String rightContext = token.after();
System.out.println(leftContext);
System.out.println(rightContext);
String word = token.get(TextAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
String pos = token.get(PartOfSpeechAnnotation.class);
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne);
}
}
}
}
}
}
这是规则文件:
$TITLES_CORPORATE = (/chief/ /administrative/ /officer/|cao|ceo|/chief/ /executive/ /officer/|/chairman/|/vice/ /president/)
$ORGANIZATION_TITLES = (/International/|/inc\./|/corp/|/llc/)
# For detecting organization names like 'Paragonix Inc.'
{ ruleType: "tokens",
pattern: ([{pos: NNP}]+ $ORGANIZATION_TITLES),
action: ( Annotate($0, ner, "ORGANIZATION"),Annotate($1, ner, "ORGANIZATION") )
}
# For extracting organization names from a pattern - 'Genome International is planning to expand its operations in China.'
#(in the sentence given above the words planning and expand are part of the $OrgContextWords macros )
{
ruleType: "tokens",
pattern: (([{tag:/NNP.*/}]+) /,/*? /is|had|has|will|would/*? /has|had|have|will/*? /be|been|being/*? (?:[]{0,5}[{lemma:$OrgContextWords}]) /of|in|with|for|to|at|like|on/*?),
result: ( Annotate($1, ner, "ORGANIZATION") )
}
# For sentence like - Bill Edelman, Chairman and CEO of Paragonix Inc./ Zuckerberg CEO Facebook said today....
ENV.defaults["stage"] = 1
{
pattern: ( $TITLES_CORPORATE ),
action: ( Annotate($1, ner, "PERSON_TITLE"))
}
ENV.defaults["stage"] = 2
{
ruleType: "tokens",
pattern: ( ([ { pos:NNP} ]+) /,/*? (?:TITLES_CORPORATE)? /and|&/*? (?:TITLES_CORPORATE)? /,/*? /of|for/? /,/*? [ { pos:NNP } ]+ ),
result: (Annotate($1, ner, "PERSON"),Annotate($2, ner, "ORGANIZATION"))
}
我得到的输出是:
matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=PERSON
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION
我期待的输出是:
matched: Paragonix Inc. announced that the company is expanding with
value: LIST([LIST([ORGANIZATION, ORGANIZATION])])
matched token: word=Paragonix, lemma=Paragonix, pos=NNPne=ORGANIZATION
matched token: word=Inc., lemma=Inc., pos=NNP, ne=ORGANIZATION
此外,比尔埃德尔曼也没有被认定为人。包含比尔埃德尔曼的短语虽然我已经制定了规则,但并没有被确定。我是否需要为整个短语制定规则以匹配每个规则,因此不会错过任何实体?
答案 0 :(得分:2)
我在主GitHub页面(截至4月14日)制作了一个代表最新Stanford CoreNLP的jar。
此命令(使用最新代码)应该适用于使用TokensRegexAnnotator(或者,如果使用Java API,则可以将tokensregex设置传递给Properties对象):
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules example.rules -tokensregex.caseInsensitive -file example.txt -outputFormat text
这是我编写的规则文件,显示基于句型的匹配:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
$ORGANIZATION_TITLES = "/inc\.|corp\./"
$COMPANY_INDICATOR_WORDS = "/company|corporation/"
{ pattern: (([{pos: NNP}]+ $ORGANIZATION_TITLES) /is/ /a/ $COMPANY_INDICATOR_WORDS), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
{ pattern: ($COMPANY_INDICATOR_WORDS /that/ ([{pos: NNP}]+) /works/ /for/), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
请注意,$0
表示整个模式,$1
表示第一个捕获组。所以在这个例子中,我在文本周围添加了一个额外的括号,表示我想要匹配的内容。
我在示例上运行了这个:Paragonix Inc. is a company that Joe Smith works for.
此示例显示使用第二轮第一轮的提取:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
$ORGANIZATION_TITLES = "/inc\.|corp\./"
$COMPANY_INDICATOR_WORDS = "/company|corporation/"
ENV.defaults["stage"] = 1
{ pattern: (/works/ /for/ ([{pos: NNP}]+ $ORGANIZATION_TITLES)), action: (Annotate($1, ner, "RULE_FOUND_ORG") ) }
ENV.defaults["stage"] = 2
{ pattern: (([{pos: NNP}]+) /works/ /for/ [{ner: "RULE_FOUND_ORG"}]), action: (Annotate($1, ner, "RULE_FOUND_PERSON") ) }
此示例应适用于句子Joe Smith works for Paragonix Inc.