我想将一个句子解析为此表单的二进制解析(SNLI语料库中使用的格式):
句子:“骑马的人跳过一架破损的飞机。”
parse :(((A person)(on(a horse)))((跳过(a(断(下飞机)))))。))
我无法找到执行此操作的解析器。
注意:此问题已在早些时候提出(How to get a binary parse in Python)。但答案没有帮助。我无法评论,因为我没有所需的声誉。
答案 0 :(得分:0)
以下是一些示例代码,用于删除树中每个节点的标签。
package edu.stanford.nlp.examples;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class PrintTreeWithoutLabelsExample {
public static void main(String[] args) {
// set up pipeline properties
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,parse");
// use faster shift reduce parser
props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz");
props.setProperty("parse.maxlen", "100");
props.setProperty("parse.binaryTrees", "true");
// set up Stanford CoreNLP pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// build annotation for text
Annotation annotation = new Annotation("The red car drove on the highway.");
// annotate the review
pipeline.annotate(annotation);
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
Tree sentenceConstituencyParse = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
for (Tree subTree : sentenceConstituencyParse.subTrees()) {
if (!subTree.isLeaf())
subTree.setLabel(CoreLabel.wordFromString(""));
}
TreePrint treePrint = new TreePrint("oneline");
treePrint.printTree(sentenceConstituencyParse);
}
}
}
答案 1 :(得分:0)
我分析了接受的版本,因为我在python中需要一些东西,我创建了一个简单的函数,它创建了相同的结果。为了解析句子,我改编了在the referenced link找到的版本。
import re
import string
from stanfordcorenlp import StanfordCoreNLP
from nltk import Tree
from functools import reduce
regex = re.compile('[%s]' % re.escape(string.punctuation))
def parse_sentence(sentence):
nlp = StanfordCoreNLP(r'./stanford-corenlp-full-2018-02-27')
sentence = regex.sub('', sentence)
result = nlp.parse(sentence)
result = result.replace('\n', '')
result = re.sub(' +',' ', result)
nlp.close() # Do not forget to close! The backend server will consume a lot memery.
return result.encode("utf-8")
def binarize(parsed_sentence):
sentence = sentence.replace("\n", "")
for pattern in ["ROOT", "SINV", "NP", "S", "PP", "ADJP", "SBAR",
"DT", "JJ", "NNS", "VP", "VBP", "RB"]:
sentence = sentence.replace("({}".format(pattern), "(")
sentence = re.sub(' +',' ', sentence)
return sentence
我或被接受的版本都没有提供与SNLI或MultiNLI语料库中显示的相同的结果,因为它们将树的两个单叶一起收集到一个。 MultiNLI语料库中的一个示例显示
"( ( The ( new rights ) ) ( are ( nice enough ) ) )"
,
此处展位答案返回
'( ( ( ( The) ( new) ( rights)) ( ( are) ( ( nice) ( enough)))))'
。
我不是NLP的专家,所以我希望这没有任何区别。至少它不适用于我的应用程序。