如何使用Stanford Parser解析英语以外的语言?在java中,而不是命令行

时间:2012-07-11 09:36:24

标签: java nlp stanford-nlp

我一直试图在我的Java程序中使用Stanford Parser来解析一些中文句子。由于我在Java和Stanford Parser都很新,我使用'ParseDemo.java'来练习。该代码适用于英语句子并输出正确的结果。然而,当我将模型更改为'chinesePCFG.ser.gz'并尝试解析一些分段的中文句子时,出现了问题。

这是我在Java中的代码

class ParserDemo {

  public static void main(String[] args) {
    LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz");
    if (args.length > 0) {
      demoDP(lp, args[0]);
    } else {
      demoAPI(lp);
    }
  }

  public static void demoDP(LexicalizedParser lp, String filename) {
    // This option shows loading and sentence-segment and tokenizing
    // a file using DocumentPreprocessor
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    // You could also create a tokenier here (as below) and pass it
    // to DocumentPreprocessor
    for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
      Tree parse = lp.apply(sentence);
      parse.pennPrint();
      System.out.println();

      GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
      Collection tdl = gs.typedDependenciesCCprocessed(true);
      System.out.println(tdl);
      System.out.println();
    }
  }

  public static void demoAPI(LexicalizedParser lp) {
    // This option shows parsing a list of correctly tokenized words
    String sent[] = { "我", "是", "一名", "学生" };
    List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
    Tree parse = lp.apply(rawWords);
    parse.pennPrint();
    System.out.println();

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }

  private ParserDemo() {} // static methods only
}

它与ParserDemo.java基本相同,但是当我运行它时,我得到以下结果:

  

从序列化文件加载解析器   edu / stanford / nlp / models / lexparser / chinesePCFG.ser.gz ...完成[2.2   秒]。 (ROOT(IP       (NP(PN我))       (副总裁(VC是)         (NP           (QP(CD一名))           (NP(NN学生))))))

     

线程“main”中的异常java.lang.RuntimeException:失败   调用公共   edu.stanford.nlp.trees.EnglishGrammaticalStructure(edu.stanford.nlp.trees.Tree)     在   edu.stanford.nlp.trees.GrammaticalStructureFactory.newGrammaticalStructure(GrammaticalStructureFactory.java:104)     在parserdemo.ParserDemo.demoAPI(ParserDemo.java:65)at   parserdemo.ParserDemo.main(ParserDemo.java:23)

第65行的代码是:

 GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);

我猜测中国PCFG.ser.gz错过了与'edu.stanford.nlp.trees.EnglishGrammaticalStructure'相关的内容。由于解析器通过命令行正确解析中文,因此我自己的代码一定有问题。我一直在寻找,只是发现了一些类似的案例,其中一些提到了使用正确的模型,但我真的不知道如何将代码修改为“正确的模型”。希望有人可以帮助我。我是Java和Stanford Parser的新手,所以请具体一点。谢谢!

1 个答案:

答案 0 :(得分:2)

问题是GrammaticalStructureFactory是由PennTreebankLanguagePack构建的,用于英语Penn Treebank。你需要使用(在两个地方)

TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();

并适当导入

import edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack;

但我们通常也建议对中文使用因式解析器(因为它的效果要好得多,与英语不同,但代价是更多的内存和时间使用)

LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz");