我想使用stanfordNLP的TokensRegexNERAnnotator将以下内容识别为SKILL。
AREAS OF EXPERTISE
Areas of Knowledge
Computer Skills
Technical Experience
Technical Skills
上面还有更多的文字序列。
代码 -
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true));
String[] tests = {"Bachelor of Arts is a good degree.", "Technical Skill is a must have for Software Developer."};
List tokens = new ArrayList<>();
// traversing each sentence from array of sentence.
for (String txt : tests) {
System.out.println("String is : " + txt);
// create an empty Annotation just with the given text
Annotation document = new Annotation(txt);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
/* Next we can go over the annotated sentences and extract the annotated words,
Using the CoreLabel Object */
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println("annotated coreMap sentences : " + token);
// Extracting NER tag for current token
String ne = token.get(NamedEntityTagAnnotation.class);
String word = token.get(CoreAnnotations.TextAnnotation.class);
System.out.println("Current Word : " + word + " POS :" + token.get(PartOfSpeechAnnotation.class));
System.out.println("Lemma : " + token.get(LemmaAnnotation.class));
System.out.println("Named Entity : " + ne);
}
}
我的正则表达式规则文件是 -
$ SKILL_FIRST_KEYWORD =&#34; / / / / / / / / / / / / / / / 34 /区域/ $ SKILL_KEYWORD =&#34; / knowledge / | / skill / | / skills / | / expertise / | / experience /&#34;
tokens = {type:&#34; CLASS&#34;,value:&#34; edu.stanford.nlp.ling.CoreAnnotations $ TokensAnnotation&#34; }
{ ruleType:&#34;令牌&#34;, pattern:($ SKILL_FIRST_KEYWORD + $ SKILL_KEYWORD), 结果:&#34;技能&#34; }
我收到ArrayIndexOutOfBoundsException
错误。我猜我的规则文件有问题。有人可以指出我在哪里弄错了吗?
期望的输出 -
专业领域 - 技能
知识领域 - 技能
计算机技能 - 技能
等等。
提前致谢。
答案 0 :(得分:1)
你应该使用TokensRegexAnnotator而不是TokensRegexNERAnnotator。
您应该查看这些主题以获取更多信息:
答案 1 :(得分:0)
以上接受@StanfordNLPHelp的回答,帮我解决了这个问题。所有的功劳归于他/她。
我只是总结了最终代码如何以所需的格式获得输出,希望它能帮到某些人。
首先我改变了规则文件
$SKILL_FIRST_KEYWORD = "/area of|areas of|Technical|computer|professional/"
$SKILL_KEYWORD = "/knowledge|skill|skills|expertise|experience/"
然后在代码中
props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
for (String txt : tests) {
System.out.println("String is : " + txt);
// create an empty Annotation just with the given text
Annotation document = new Annotation(txt);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
Env env = TokenSequencePattern.getNewEnv();
env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);
CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, "test_degree.rules");
for (CoreMap sentence : sentences) {
List<MatchedExpression> matched = extractor.extractExpressions(sentence);
for(MatchedExpression phrase : matched){
// Print out matched text and value
System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
}
}
}