我使用Stanford Core NLP框架3.4.1来构建维基百科句子的句法分析树。之后我想从每个解析树中提取出一定长度的所有树片段(即最多5个节点),但是如果没有为每个子节点创建一个新的GrammaticalStructure,我在找出如何做到这一点时遇到了很多麻烦。 -tree。
这是我用来构造解析树的,大部分代码来自TreePrint.printTreeInternal(),用于conll2007格式,我根据输出需要进行了修改:
DocumentPreprocessor dp = new DocumentPreprocessor(new StringReader(documentText));
for (List<HasWord> sentence : dp) {
StringBuilder plaintexSyntacticTree = new StringBuilder();
String sentenceString = Sentence.listToString(sentence);
PTBTokenizer tkzr = PTBTokenizer.newPTBTokenizer(new StringReader(sentenceString));
List toks = tkzr.tokenize();
// skip sentences smaller than 5 words
if (toks.size() < 5)
continue;
log.info("\nTokens are: "+PTBTokenizer.labelList2Text(toks));
LexicalizedParser lp = LexicalizedParser.loadModel(
"edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz",
"-maxLength", "80");
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
Tree parse = lp.apply(toks);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection<TypedDependency> tdl = gs.allTypedDependencies();
Tree it = parse.deepCopy(parse.treeFactory(), CoreLabel.factory());
it.indexLeaves();
List<CoreLabel> tagged = it.taggedLabeledYield();
// getSortedDeps
List<Dependency<Label, Label, Object>> sortedDeps = new ArrayList<Dependency<Label, Label, Object>>();
for (TypedDependency dep : tdl) {
NamedDependency nd = new NamedDependency(dep.gov().label(), dep.dep().label(), dep.reln().toString());
sortedDeps.add(nd);
}
Collections.sort(sortedDeps, Dependencies.dependencyIndexComparator());
for (int i = 0; i < sortedDeps.size(); i++) {
Dependency<Label, Label, Object> d = sortedDeps.get(i);
CoreMap dep = (CoreMap) d.dependent();
CoreMap gov = (CoreMap) d.governor();
Integer depi = dep.get(CoreAnnotations.IndexAnnotation.class);
Integer govi = gov.get(CoreAnnotations.IndexAnnotation.class);
CoreLabel w = tagged.get(depi-1);
// Used for both course and fine POS tag fields
String tag = PTBTokenizer.ptbToken2Text(w.tag());
String word = PTBTokenizer.ptbToken2Text(w.word());
if (plaintexSyntacticTree.length() > 0)
plaintexSyntacticTree.append(' ');
plaintexSyntacticTree.append(word+'/'+tag+'/'+govi);
}
log.info("\nTree is: "+plaintexSyntacticTree);
}
在输出中我需要获得这种格式的东西:word / Part-Of-Speech-tag / parentID,它与Google Syntactic N-Grams
的输出兼容我无法弄清楚,我是如何从原始语法分析树中获取POS标记和parentID(据我所知,存储在GrammaticalStructure中作为依赖列表)仅用于节点的子集来自原始树。
我也看到一些关于HeadFinder的提及,但据我所知,这只对构建语法结构有用,而我试图使用现有的语法结构。 我还看到了关于converting GrammaticalStructure to Tree的类似问题,但这仍然是一个悬而未决的问题,并没有解决子树问题或创建自定义输出。而不是从GrammaticalStructure创建一个树,我想我可以使用树中的节点引用来获取我需要的信息,但我基本上缺少一个getNodeByIndex()的等价物,它可以从GrammaticalStructure逐个节点获得。
更新:我已按照答案中的建议使用SemanticGraph设法获取所有必需信息。以下是执行该操作的基本代码片段:
String documentText = value.toString();
Properties props = new Properties();
props.put("annotators", "tokenize,ssplit,pos,depparse");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation(documentText);
pipeline.annotate(annotation);
List<CoreMap> sentences = annotation.get(CoreAnnotations.SentencesAnnotation.class);
if (sentences != null && sentences.size() > 0) {
CoreMap sentence = sentences.get(0);
SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
log.info("SemanticGraph: "+sg.toDotFormat());
for (SemanticGraphEdge edge : sg.edgeIterable()) {
int headIndex = edge.getGovernor().index();
int depIndex = edge.getDependent().index();
log.info("["+headIndex+"]"+edge.getSource().word()+"/"+depIndex+"/"+edge.getSource().get(CoreAnnotations.PartOfSpeechAnnotation.class));
}
}
答案 0 :(得分:0)
Google语法n-gram使用依赖树而不是选区树。实际上,获得该表示的唯一方法是将树转换为依赖树。您从选区解析中获得的父ID将用于中间节点,而不是句子中的另一个单词。
我的建议是运行依赖解析器注释器(annotators = tokenize,ssplit,pos,depparse
),并从结果SemanticGraph
中提取5个相邻节点的所有簇。