我一直试图在我的Java程序中使用Stanford Parser来解析一些中文句子。由于我在Java和Stanford Parser都很新,我使用'ParseDemo.java'来练习。该代码适用于英语句子并输出正确的结果。然而,当我将模型更改为'chinesePCFG.ser.gz'并尝试解析一些分段的中文句子时,出现了问题。
这是我在Java中的代码
class ParserDemo {
public static void main(String[] args) {
LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz");
if (args.length > 0) {
demoDP(lp, args[0]);
} else {
demoAPI(lp);
}
}
public static void demoDP(LexicalizedParser lp, String filename) {
// This option shows loading and sentence-segment and tokenizing
// a file using DocumentPreprocessor
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
// You could also create a tokenier here (as below) and pass it
// to DocumentPreprocessor
for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
Tree parse = lp.apply(sentence);
parse.pennPrint();
System.out.println();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCCprocessed(true);
System.out.println(tdl);
System.out.println();
}
}
public static void demoAPI(LexicalizedParser lp) {
// This option shows parsing a list of correctly tokenized words
String sent[] = { "我", "是", "一名", "学生" };
List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
Tree parse = lp.apply(rawWords);
parse.pennPrint();
System.out.println();
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
System.out.println();
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.printTree(parse);
}
private ParserDemo() {} // static methods only
}
它与ParserDemo.java基本相同,但是当我运行它时,我得到以下结果:
从序列化文件加载解析器 edu / stanford / nlp / models / lexparser / chinesePCFG.ser.gz ...完成[2.2 秒]。 (ROOT(IP (NP(PN我)) (副总裁(VC是) (NP (QP(CD一名)) (NP(NN学生))))))
线程“main”中的异常java.lang.RuntimeException:失败 调用公共 edu.stanford.nlp.trees.EnglishGrammaticalStructure(edu.stanford.nlp.trees.Tree) 在 edu.stanford.nlp.trees.GrammaticalStructureFactory.newGrammaticalStructure(GrammaticalStructureFactory.java:104) 在parserdemo.ParserDemo.demoAPI(ParserDemo.java:65)at parserdemo.ParserDemo.main(ParserDemo.java:23)
第65行的代码是:
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
我猜测中国PCFG.ser.gz错过了与'edu.stanford.nlp.trees.EnglishGrammaticalStructure'相关的内容。由于解析器通过命令行正确解析中文,因此我自己的代码一定有问题。我一直在寻找,只是发现了一些类似的案例,其中一些提到了使用正确的模型,但我真的不知道如何将代码修改为“正确的模型”。希望有人可以帮助我。我是Java和Stanford Parser的新手,所以请具体一点。谢谢!
答案 0 :(得分:2)
问题是GrammaticalStructureFactory是由PennTreebankLanguagePack
构建的,用于英语Penn Treebank。你需要使用(在两个地方)
TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();
并适当导入
import edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack;
但我们通常也建议对中文使用因式解析器(因为它的效果要好得多,与英语不同,但代价是更多的内存和时间使用)
LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz");