Question

一个节点在从Stanford Parser获得的解析树中可能有两个以上的子节点，例如englishPCFG.ser.gz 如何在每个节点上获取带有POS标记信息的二值化解析树？是否有任何参数要填充到解析器中以实现此目的？

Answer 1

树不是严格的二进制分支，因为训练解析器的Penn树库不是。这是（现在很古老的）树库的一个理论问题，它继续困扰着计算语言学家！

我处理这个问题的方法是编写复杂的树转换逻辑，将选区解析器的输出重构为二元分支结构，使用X-bar理论表示 - 在过程促进中关于词汇短语的功能预测，提高量词等等。

[更新]我尝试了TreeBinarizer课程。它在我使用的一个例子上运作良好。我正在解析西班牙语，并使用Clojure。这是一个示例会话：

user=> (import edu.stanford.nlp.parser.lexparser.TreeBinarizer)
edu.stanford.nlp.parser.lexparser.TreeBinarizer
user=> (import     edu.stanford.nlp.trees.international.spanish.SpanishTreebankLanguagePack)
edu.stanford.nlp.trees.international.spanish.SpanishTreebankLanguagePack
user=> (import     edu.stanford.nlp.trees.international.spanish.SpanishHeadFinder)
edu.stanford.nlp.trees.international.spanish.SpanishHeadFinder
user=> ; I have a parsed tree:

user=> (.pennPrint t)
(sp
  (prep (sp000 a))
  (S
    (infinitiu (vmn0000 decir))
    (S
      (conj (cs que))
      (grup.verb (vaip000 hemos) (vmp0000 visto))
      (sn
        (spec (di0000 un))
        (grup.nom (nc0s000 relámpago))))))
nil
user=> ; let's create a binarizer

user=> (def tb (TreeBinarizer/simpleTreeBinarizer (SpanishHeadFinder.) (SpanishTreebankLanguagePack.)))
#'user/tb
user=> ; now transform the tree above -- note that the second embedded S node has three children

user=> (.pennPrint (.transformTree tb t))
(sp
  (prep (sp000 a))
  (S
    (infinitiu (vmn0000 decir))
    (S
      (conj (cs que))
      (@S
        (grup.verb (vaip000 hemos) (vmp0000 visto))
        (sn
          (spec (di0000 un))
          (grup.nom (nc0s000 relámpago)))))))
nil
user=> ; the binarizer created an intermediate phrasal node @S, pushing the conjuction into <Spec, @S>

如何从Stanford Parser获取二值化解析树？

1 个答案: