如何使用Stanford CoreNLP从TreeAnnotation中提取未标记/无类型的依赖关系树?

时间:2015-06-15 14:07:19

标签: java stanford-nlp

目标语言是西班牙语。

英语管道支持类型化依赖项,而据我所知,西班牙语管道不支持。

目标是从TreeAnnotation生成依赖关系树,其中最终结果是有向边的列表。这是否可以使用CoreNLP 3.4.1并使用西班牙语模型,如果是这样的话:怎么做?

背景

我正在使用Stanford CoreNLP 3.4.1 +(用于POS标记的3.5.0西班牙语模型)(由于兼容性原因,Java 8尚未使用),具有以下配置:

Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, ner, parse");
props.setProperty("tokenize.options", "invertible=true,ptb3Escaping=true");
props.setProperty("tokenize.language", "es");

props.setProperty("pos.model", "edu/stanford/nlp/models/pos-tagger/spanish/spanish-distsim.tagger");
props.setProperty("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz");

props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/spanishSR.ser.gz"); //Stanford Parser 3.4.1 shift-reduce models for Spanish. 

props.setProperty("ner.applyNumericClassifiers", "false");
props.setProperty("ner.useSUTime", "false");

然后使用它来创建管道并运行文档的注释。

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);

List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

for(CoreMap sentence: sentences) {

    // ... extract start, end position of sentence ...

    for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {

        // ... extract POS tags, NER annotations, id ...
    }

    //This works, and I have a tree that is not empty.
    Tree tree = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
}

通过使用调试器,我能够检查句子和标记,并得出结论:它们具有以下内容:

句子(键)

来自edu.stanford.nlp.ling.CoreAnnotations:

  • TextAnnotation
  • CharacterOffsetBeginAnnotation
  • CharacterOffsetEndAnnotation
  • TokensAnnotation
  • TokenBeginAnnotation
  • TokenEndAnnotation
  • SentenceIndexAnnotation

来自edu.stanford.nlp.trees.TreeCoreAnnotations

  • TreeAnnotation

代币(键)

来自edu.stanford.nlp.ling.CoreAnnotations

  • TextAnnotation
  • OriginalTextAnnotation
  • CharacterOffsetBeginAnnotation
  • CharacterOffsetEndAnnotation
  • BeforeAnnotation
  • AfterAnnotation
  • IndexAnnotation
  • SentenceIndexAnnotation
  • PartOfSpeechAnnotation
  • NamedEntityTagAnnotation

来自edu.stanford.nlp.trees.TreeCoreAnnotations

  • HeadWordAnnotation - 在我的实验中:这个总是指向自己,即从中检索注释的标记。
  • HeadTagAnnotation

提前致谢!

1 个答案:

答案 0 :(得分:1)

There is no support for Spanish dependency parsing in CoreNLP at the moment. This includes typed dependency conversion from constituency parses.

There is a head finder implemented (but not fully tested). You could hack an untyped dependency converter using this head finder, but we have no guarantees that this will yield a sensible parse.