使用不带NLTK的Python解析词性标记的树语料库

时间:2019-02-10 03:30:31

标签: python regex parsing nlp text-mining

我的树语料库如下

display

我需要解析这棵树并转换为如下的句子形式

(TOP END_OF_TEXT_UNIT)

(TOP (S (NP (DT The)
            (NNP Fulton)
            (NNP County)
            (NNP Grand)
            (NNP Jury))
        (VP (VBD said)
            (NP (NNP Friday))
            (SBAR (-NONE- 0)
                  (S (NP (DT an)
                         (NN investigation)
                         (PP (IN of)
                             (NP (NP (NNP Atlanta))
                                 (POS 's)
                                 (JJ recent)
                                 (JJ primary)
                                 (NN election))))
                     (VP (VBD produced)
                         (NP (`` ``)
                             (DT no)
                             (NN evidence)
                             ('' '')
                             (SBAR (IN that)
                                   (S (NP (DT any)
                                          (NNS irregularities))
                                      (VP (VBD took)
                                          (NP (NN place)))))))))))
     (. .))

是否有任何算法可以解析上述内容,或者我们需要使用正则表达式来做到这一点,所以我不想使用NLTK软件包来做到这一点。

1 个答案:

答案 0 :(得分:1)

Pyparsing使嵌套表达式解析快速进行。

ToCharArray()

打印:

import pyparsing as pp

LPAR, RPAR = map(pp.Suppress, "()")
expr = pp.Forward()
label = pp.Word(pp.alphas.upper()+'-') | "''" | "``" | "."
word = pp.Literal(".") | "''" | "``" | pp.Word(pp.printables, excludeChars="()")

expr <<= LPAR + label + (word | pp.OneOrMore(expr)) + RPAR

sample = """
(TOP (S (NP (DT The)
            (NNP Fulton)
            (NNP County)
            (NNP Grand)
            (NNP Jury))
        (VP (VBD said)
            (NP (NNP Friday))
            (SBAR (-NONE- 0)
                  (S (NP (DT an)
                         (NN investigation)
                         (PP (IN of)
                             (NP (NP (NNP Atlanta))
                                 (POS 's)
                                 (JJ recent)
                                 (JJ primary)
                                 (NN election))))
                     (VP (VBD produced)
                         (NP (`` ``)
                             (DT no)
                             (NN evidence)
                             ('' '')
                             (SBAR (IN that)
                                   (S (NP (DT any)
                                          (NNS irregularities))
                                      (VP (VBD took)
                                          (NP (NN place)))))))))))
     (. .))
"""

result = pp.OneOrMore(expr).parseString(sample)
print(' '.join(result))

通常,这样的解析器将使用TOP S NP DT The NNP Fulton NNP County NNP Grand NNP Jury VP VBD said NP NNP Friday SBAR -NONE- 0 S NP DT an NN investigation PP IN of NP NP NNP Atlanta POS 's JJ recent JJ primary NN election VP VBD produced NP `` `` DT no NN evidence '' '' SBAR IN that S NP DT any NNS irregularities VP VBD took NP NN place . . 来保留嵌套元素的分组。但是在您的情况下,由于无论如何您最终还是想要一个平面列表,因此我们将其保留在外-pyparsing的默认行为是仅返回匹配字符串的平面列表。