StanfordNLP - TokensRegexNERAnnotator.readEntries上的ArrayIndexOutOfBoundsException(TokensRegexNERAnnotator.java:696)

时间:2017-04-29 04:52:03

标签: java nlp stanford-nlp

我想使用stanfordNLP的TokensRegexNERAnnotator将以下内容识别为SKILL。

AREAS OF EXPERTISE Areas of Knowledge Computer Skills Technical Experience Technical Skills

上面还有更多的文字序列。

代码 -

    Properties props = new Properties();
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    pipeline.addAnnotator(new TokensRegexNERAnnotator("./mapping/test_degree.rule", true));
    String[] tests = {"Bachelor of Arts is a good degree.", "Technical Skill is a must have for Software Developer."};
    List tokens = new ArrayList<>();

    // traversing each sentence from array of sentence.
    for (String txt : tests) {
         System.out.println("String is : " + txt);

         // create an empty Annotation just with the given text
         Annotation document = new Annotation(txt);

         pipeline.annotate(document);
         List<CoreMap> sentences = document.get(SentencesAnnotation.class);

         /* Next we can go over the annotated sentences and extract the annotated words,
         Using the CoreLabel Object */
      for (CoreMap sentence : sentences) {
         for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
             System.out.println("annotated coreMap sentences : " + token);
             // Extracting NER tag for current token
             String ne = token.get(NamedEntityTagAnnotation.class);
             String word = token.get(CoreAnnotations.TextAnnotation.class);
             System.out.println("Current Word : " + word + " POS :" + token.get(PartOfSpeechAnnotation.class));
             System.out.println("Lemma : " + token.get(LemmaAnnotation.class));
             System.out.println("Named Entity : " + ne);
    }
  }

我的正则表达式规则文件是 -

$ SKILL_FIRST_KEYWORD =&#34; / / / / / / / / / / / / / / / 34 /区域/ $ SKILL_KEYWORD =&#34; / knowledge / | / skill / | / skills / | / expertise / | / experience /&#34;

tokens = {type:&#34; CLASS&#34;,value:&#34; edu.stanford.nlp.ling.CoreAnnotations $ TokensAnnotation&#34; }

{      ruleType:&#34;令牌&#34;,      pattern:($ SKILL_FIRST_KEYWORD + $ SKILL_KEYWORD),      结果:&#34;技能&#34; }

我收到ArrayIndexOutOfBoundsException错误。我猜我的规则文件有问题。有人可以指出我在哪里弄错了吗?

期望的输出 -

专业领域 - 技能

知识领域 - 技能

计算机技能 - 技能

等等。

提前致谢。

2 个答案:

答案 0 :(得分:1)

你应该使用TokensRegexAnnotator而不是TokensRegexNERAnnotator。

您应该查看这些主题以获取更多信息:

TokensRegex rules to get correct output for Named Entities

Getting output in the desired format using TokenRegex

答案 1 :(得分:0)

以上接受@StanfordNLPHelp的回答,帮我解决了这个问题。所有的功劳归于他/她。

我只是总结了最终代码如何以所需的格式获得输出,希望它能帮到某些人。

首先我改变了规则文件

$SKILL_FIRST_KEYWORD = "/area of|areas of|Technical|computer|professional/" $SKILL_KEYWORD = "/knowledge|skill|skills|expertise|experience/"

然后在代码中

props.put("annotators", "tokenize, ssplit, pos, lemma, ner");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

for (String txt : tests) {
     System.out.println("String is : " + txt);

     // create an empty Annotation just with the given text
     Annotation document = new Annotation(txt);

     pipeline.annotate(document);
     List<CoreMap> sentences = document.get(SentencesAnnotation.class);

     Env env = TokenSequencePattern.getNewEnv();
     env.setDefaultStringMatchFlags(NodePattern.CASE_INSENSITIVE);
     env.setDefaultStringPatternFlags(Pattern.CASE_INSENSITIVE);

     CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(env, "test_degree.rules");
     for (CoreMap sentence : sentences) {
         List<MatchedExpression> matched = extractor.extractExpressions(sentence);
         for(MatchedExpression phrase : matched){
             // Print out matched text and value
             System.out.println("MATCHED ENTITY: " + phrase.getText() + " VALUE: " + phrase.getValue().get());
         }
    }
}