Question

我正在使用Stanford NLP Parsing工具包。根据词典中的一个词，我怎样才能找到它的频率*？或者，给定频率等级，我如何确定相应的单词？

*使用整个语言，而不仅仅是文本样本。

这是我正在使用的工具包的演示：

class ParserDemo {
  public static void main(String[] args) {
    LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
    lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

    String[] sent = { "Sincerity", "may", "frighten", "the", "boy", "." };
    Tree parse = (Tree) lp.apply(Arrays.asList(sent));
    parse.pennPrint();
    System.out.println();

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    Collection tdl = gs.typedDependenciesCollapsed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }

}

Answer 1

如果您只计算单词频率，则不需要句子解析。您需要做的只是 tokenise 输入，然后使用java HashMap计算字频率。如果您想使用Stanford工具，请使用edu.stanford.nlp.process中的任何标记器。

这为您提供任何给定单词的频率，但一般情况下可能无法找到与给定频率等级对应的单词，因为某些单词在文档中可能同样频繁。

Answer 2

这是一个比NLP更多的IR（信息检索）问题。对于此任务，应该查看Lucene等库。

Java Stanford NLP：查找单词频率？

2 个答案: