Question

我一直试图在我的Java程序中使用Stanford Parser来解析一些中文句子。由于我在Java和Stanford Parser都很新，我使用'ParseDemo.java'来练习。该代码适用于英语句子并输出正确的结果。然而，当我将模型更改为'chinesePCFG.ser.gz'并尝试解析一些分段的中文句子时，出现了问题。

这是我在Java中的代码

class ParserDemo {

  public static void main(String[] args) {
    LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz");
    if (args.length > 0) {
      demoDP(lp, args[0]);
    } else {
      demoAPI(lp);
    }
  }

  public static void demoDP(LexicalizedParser lp, String filename) {
    // This option shows loading and sentence-segment and tokenizing
    // a file using DocumentPreprocessor
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    // You could also create a tokenier here (as below) and pass it
    // to DocumentPreprocessor
    for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
      Tree parse = lp.apply(sentence);
      parse.pennPrint();
      System.out.println();

      GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
      Collection tdl = gs.typedDependenciesCCprocessed(true);
      System.out.println(tdl);
      System.out.println();
    }
  }

  public static void demoAPI(LexicalizedParser lp) {
    // This option shows parsing a list of correctly tokenized words
    String sent[] = { "我", "是", "一名", "学生" };
    List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
    Tree parse = lp.apply(rawWords);
    parse.pennPrint();
    System.out.println();

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }

  private ParserDemo() {} // static methods only
}

它与ParserDemo.java基本相同，但是当我运行它时，我得到以下结果：

从序列化文件加载解析器   edu / stanford / nlp / models / lexparser / chinesePCFG.ser.gz ...完成[2.2   秒]。（ROOT（IP       （NP（PN我））       （副总裁（VC是）         （NP           （QP（CD一名））           （NP（NN学生））））））

线程“main”中的异常java.lang.RuntimeException：失败   调用公共   edu.stanford.nlp.trees.EnglishGrammaticalStructure（edu.stanford.nlp.trees.Tree）     在   edu.stanford.nlp.trees.GrammaticalStructureFactory.newGrammaticalStructure（GrammaticalStructureFactory.java:104）     在parserdemo.ParserDemo.demoAPI（ParserDemo.java:65）at   parserdemo.ParserDemo.main（ParserDemo.java:23）

第65行的代码是：

 GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);

我猜测中国PCFG.ser.gz错过了与'edu.stanford.nlp.trees.EnglishGrammaticalStructure'相关的内容。由于解析器通过命令行正确解析中文，因此我自己的代码一定有问题。我一直在寻找，只是发现了一些类似的案例，其中一些提到了使用正确的模型，但我真的不知道如何将代码修改为“正确的模型”。希望有人可以帮助我。我是Java和Stanford Parser的新手，所以请具体一点。谢谢！

Answer 1

问题是GrammaticalStructureFactory是由PennTreebankLanguagePack构建的，用于英语Penn Treebank。你需要使用（在两个地方）

TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();

并适当导入

import edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack;

但我们通常也建议对中文使用因式解析器（因为它的效果要好得多，与英语不同，但代价是更多的内存和时间使用）

LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz");

如何使用Stanford Parser解析英语以外的语言？在java中，而不是命令行

1 个答案: