我有一个以下列格式存储在txt文件中的标记句子列表:
We_PRP 've_VBP just_RB wrapped_VBN up_RP with_IN the_DT boys_NNS of_IN Block_NNP B_NNP
现在我要解析句子,我找到了以下代码:
String filename = "tt.txt";
// This option shows loading and sentence-segmenting and tokenizing
// a file using DocumentPreprocessor.
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
// You could also create a tokenizer here (as below) and pass it
// to DocumentPreprocessor
for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
Tree parse = lp.apply(sentence);
parse.pennPrint();
System.out.println();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
System.out.println();
}
解析结果很长,我想知道问题就在于这行新的DocumentPreprocessor(文件名)它实际上重新判断我的句子,是否有任何方法可以跳过标记步骤?
答案 0 :(得分:0)
你可以在Parser FAQ中找到答案,我尝试过,它对我有用
// set up grammar and options as appropriate
LexicalizedParser lp = LexicalizedParser.loadModel(grammar, options);
String[] sent3 = { "It", "can", "can", "it", "." };
// Parser gets tag of second "can" wrong without help
String[] tag3 = { "PRP", "MD", "VB", "PRP", "." };
List sentence3 = new ArrayList();
for (int i = 0; i < sent3.length; i++) {
sentence3.add(new TaggedWord(sent3[i], tag3[i]));
}
Tree parse = lp.parse(sentence3);
parse.pennPrint();