Question

我一直在使用Stanford Parser进行CFG分析。我可以将输出显示为树，但我真正想要的是标记数。

所以我可以退出，例如（从Stack Overflow上的another query获取）：

(ROOT (S (NP (PRP$ My) (NN dog)) (ADVP (RB also)) (VP (VBZ likes) (NP (JJ eating) (NN sausage))) (. .)))

但我真正想要的是CSV文件中输出的标签数量：

PRP - 1
JJ - 1

这是否可以使用Stanford解析器，特别是因为我想处理多个文本文件，或者我应该使用其他程序？

Answer 1

是的，这很容易实现。

您将需要：

import java.util.HashMap;
import edu.stanford.nlp.trees.Tree;

我假设从另一个问题中你已经有了一个现有的Tree对象。我怀疑你只想要一个带有离开节点的列表（在你的例子中是PRP，NN，RB ......），但你可以为每个节点做一般。

然后遍历所有节点并仅计算叶子：

Tree tree = ...
for (int i = 1; i < tree.size(); i++) {
  Tree node = tree.getNodeNumber(i);

  if (node.isLeaf()) {
    // count here
  }
}

使用HashMap完成计数，您将在此处找到有关stackoverflow的许多示例。基本上以Hashmap开头，使用标记作为键，标记计数作为值。

编辑：抱歉，纠正了代码中的否定错误。

Answer 2

前一个答案虽然正确，但会遍历解析树中的所有节点。虽然没有现成的返回POS标记计数的方法，但您可以使用edu.stanford.nlp.trees.Trees类中的方法直接获取叶节点，如下所示：

（我正在使用Guava的Function在代码中有一点额外的优雅，但是一个简单的for循环也可以正常工作。）

Tree tree = sentence.get(TreeAnnotation.class); // parse tree of the sentence
List<CoreLabel> labels = Trees.taggedLeafLabels(tree); // returns the labels of the leaves in a Tree, augmented with POS tags.
List<String> tags = Lists.transform(labels, getPOSTag);
for (String tag : tags)
    Collections.frequency(tags, tag);

，其中

Function<CoreLabel, String> getPOSTag = new Function<CoreLabel, String>() {
    public String apply(CoreLabel core_label) { return core_label.get(PartOfSpeechAnnotation.class); }
};

斯坦福解析器 - 标签数量

2 个答案: