树节点映射到GrammaticalStructure依赖关系

时间:2015-10-09 13:52:27

标签: java tree nlp stanford-nlp

我使用Stanford Core NLP框架3.4.1来构建维基百科句子的句法分析树。之后我想从每个解析树中提取出一定长度的所有树片段(即最多5个节点),但是如果没有为每个子节点创建一个新的GrammaticalStructure,我在找出如何做到这一点时遇到了很多麻烦。 -tree。

这是我用来构造解析树的,大部分代码来自TreePrint.printTreeInternal(),用于conll2007格式,我根据输出需要进行了修改:

    DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(documentText));

    for (List<HasWord> sentence : dp) {
        StringBuilder plaintexSyntacticTree = new StringBuilder();
        String sentenceString = Sentence.listToString(sentence);

        PTBTokenizer tkzr = PTBTokenizer.newPTBTokenizer(new StringReader(sentenceString));
        List toks = tkzr.tokenize();
        // skip sentences smaller than 5 words
        if (toks.size() < 5)
            continue;
        log.info("\nTokens are: "+PTBTokenizer.labelList2Text(toks));
        LexicalizedParser lp = LexicalizedParser.loadModel(
        "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz",
        "-maxLength", "80");
        TreebankLanguagePack tlp = new PennTreebankLanguagePack();
        GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
        Tree parse = lp.apply(toks);
        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        Collection<TypedDependency> tdl = gs.allTypedDependencies();
        Tree it = parse.deepCopy(parse.treeFactory(), CoreLabel.factory());
        it.indexLeaves();

        List<CoreLabel> tagged = it.taggedLabeledYield();
        // getSortedDeps
        List<Dependency<Label, Label, Object>> sortedDeps = new ArrayList<Dependency<Label, Label, Object>>();
        for (TypedDependency dep : tdl) {
            NamedDependency nd = new NamedDependency(dep.gov().label(), dep.dep().label(), dep.reln().toString());
            sortedDeps.add(nd);
        }
        Collections.sort(sortedDeps, Dependencies.dependencyIndexComparator());

        for (int i = 0; i < sortedDeps.size(); i++) {
          Dependency<Label, Label, Object> d = sortedDeps.get(i);

          CoreMap dep = (CoreMap) d.dependent();
          CoreMap gov = (CoreMap) d.governor();

          Integer depi = dep.get(CoreAnnotations.IndexAnnotation.class);
          Integer govi = gov.get(CoreAnnotations.IndexAnnotation.class);

          CoreLabel w = tagged.get(depi-1);

          // Used for both course and fine POS tag fields
          String tag = PTBTokenizer.ptbToken2Text(w.tag());

          String word = PTBTokenizer.ptbToken2Text(w.word());

          if (plaintexSyntacticTree.length() > 0)
              plaintexSyntacticTree.append(' ');
          plaintexSyntacticTree.append(word+'/'+tag+'/'+govi);
        }
        log.info("\nTree is: "+plaintexSyntacticTree);
    }

在输出中我需要获得这种格式的东西:word / Part-Of-Speech-tag / parentID,它与Google Syntactic N-Grams

的输出兼容

我无法弄清楚,我是如何从原始语法分析树中获取POS标记和parentID(据我所知,存储在GrammaticalStructure中作为依赖列表)仅用于节点的子集来自原始树。

我也看到一些关于HeadFinder的提及,但据我所知,这只对构建语法结构有用,而我试图使用现有的语法结构。 我还看到了关于converting GrammaticalStructure to Tree的类似问题,但这仍然是一个悬而未决的问题,并没有解决子树问题或创建自定义输出。而不是从GrammaticalStructure创建一个树,我想我可以使用树中的节点引用来获取我需要的信息,但我基本上缺少一个getNodeByIndex()的等价物,它可以从GrammaticalStructure逐个节点获得。

更新:我已按照答案中的建议使用SemanticGraph设法获取所有必需信息。以下是执行该操作的基本代码片段:

    String documentText = value.toString();
    Properties props = new Properties();
    props.put("annotators", "tokenize,ssplit,pos,depparse");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation annotation = new Annotation(documentText);
    pipeline.annotate(annotation);
    List<CoreMap> sentences =  annotation.get(CoreAnnotations.SentencesAnnotation.class);

    if (sentences != null && sentences.size() > 0) {
        CoreMap sentence = sentences.get(0);
        SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
        log.info("SemanticGraph: "+sg.toDotFormat());
       for (SemanticGraphEdge edge : sg.edgeIterable()) {
           int headIndex = edge.getGovernor().index();
           int depIndex = edge.getDependent().index();
           log.info("["+headIndex+"]"+edge.getSource().word()+"/"+depIndex+"/"+edge.getSource().get(CoreAnnotations.PartOfSpeechAnnotation.class));
       }
    }

1 个答案:

答案 0 :(得分:0)

Google语法n-gram使用依赖树而不是选区树。实际上,获得该表示的唯一方法是将树转换为依赖树。您从选区解析中获得的父ID将用于中间节点,而不是句子中的另一个单词。

我的建议是运行依赖解析器注释器(annotators = tokenize,ssplit,pos,depparse),并从结果SemanticGraph中提取5个相邻节点的所有簇。