我使用下面给出的简单规则文件来检测文本中的命名实体。此规则的示例如下:
比尔盖茨,微软总裁兼董事长。
这里第一个NNP postag指的是PERSON Bill Gates,第二个NNP postag指的是Microsoft的组织。
我得到了一个空输出。
我想我不确定如何捕获PERSON和ORGANIZATION实体。我应该在我的规则文件中进行哪些更改,以便捕获这些组或至少一个组织,比如组织?
$TITLES_CORPORATE = "/chief administrative officer|cao|chief marketing officer|cmo|chief operating officer|coo|chief privacy officer|cpo|chief process officer|chief product officer|chief reputation officer|cro|chief research officer|chief restructuring officer|chief risk officer|chief science officer|cso|chief scientific Officer|chief security officer|chief services officer|chief strategy officer|chief sustainability officer|chief technology officer|vice chairman|general manager|gm|manager/";
$TITLE_PREFIXES = "/senior|executive|assistant|deputy|chief|general|staff/";
{
ruleType: "tokens",
pattern: ( [ { pos:NNP } ]+ ($TITLE_PREFIXES)? TITLES_CORPORATE /,/? /of/? [ { pos:NNP } ]+ ),
result: "ORGANIZATION"
}
这是我的代码:
public static void main(String[] args)
{
String rulesFile = "D:\\Workspace\\resource\\NERRulesFile.txt";
String dataFile = "D:\\Workspace\\\resource\\GoldSetSentences.txt";
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(dataFile);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
//List<CoreLabel> tokens = new ArrayList<CoreLabel>();
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), rulesFile);
for (CoreMap sentence:sentences) {
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
System.out.println(matched);
}
}
答案 0 :(得分:0)
这是关于令牌的例子:
[Bill, Gates, President, and, Chairman, of, Microsoft, Corp, .]
TokensRegex规则超过TOKENS,因此正则表达式需要匹配令牌。因此,您的一个示例根本不起作用,因为它包含多个令牌表达式。
这是一个与上述例子中“微软公司总裁兼董事长”匹配的模式:
pattern: (/President/ /and/? /Chairman/ /of/? [{pos: NNP}]+)