CoreNLP提取令牌的范围

时间:2013-12-14 22:24:32

标签: java annotations nlp stanford-nlp

我想提取标记String文本的范围。使用斯坦福大学的CoreNLP,我有:

Properties props;
props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma");
this.pipeline = new StanfordCoreNLP(props);

String answerText = "This is the answer";
ArrayList<IntPair> tokenSpans = new ArrayList<IntPair>();
// create an empty Annotation with just the given text
Annotation document = new Annotation(answerText);
// run all Annotators on this text
this.pipeline.annotate(document);

// Iterate over all of the sentences
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
    // Iterate over all tokens in a sentence
    for (CoreLabel fullToken: sentence.get(TokensAnnotation.class)) {
        IntPair span = fullToken.get(SpanAnnotation.class);
        tokenSpans.add(span);
    }
}

但是,所有IntPairs都是null。我是否需要在该行中添加另一个annotator

props.put("annotators", "tokenize, ssplit, pos, lemma");

期望的输出:

(0,3), (5,6), (8,10), (12,17)

1 个答案:

答案 0 :(得分:2)

问题在于使用适用于SpanAnnotation的{​​{1}}。此查询的正确类别为TreesCharacterOffsetBeginAnnotation

E.g。它们可以像这样使用:

CharacterOffsetEndAnnotation

...原谅我的缩进