从python树表示中提取父节点和子节点

时间:2015-03-25 04:04:29

标签: python tree extract nltk

[Tree('ROOT', [Tree('S', [Tree('INTJ', [Tree('UH', ['Hello'])]), Tree(',', [',']), Tree('NP', [Tree('PRP$', ['My']), Tree('NN', ['name'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('JJ', ['Melroy'])])]), Tree('.', ['.'])])]), Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('PRP$', ['your']), Tree('NN', ['name'])])]), Tree('.', ['?'])])])]

我在Python中提供了许多这些字符串,实际上是树形表示。我想为每个单词提取父节点和子节点,例如对于'Hello',我想要(INTJ, UH),对于'My'则为(NP, PRP$)

这是我想要的结果:

(INTJ, UH) , (NP, PRP$), (NP, NN) , (VP, VBZ) , (VP , VPZ) , (ADJP, JJ) , (WHNP, WP), (SQ, VBZ), (NP, PRP$), (NP, NN)

我该怎么做?

1 个答案:

答案 0 :(得分:2)

您的字符串显然是Tree个对象列表的表示。如果您可以访问或以其他方式重建该列表会更好 - 如果不是,创建可以使用的数据结构的最直接的方法是eval()(包含所有{{} 3}}关于在用户提供的数据上调用eval()

由于您没有对Tree课程做任何说明,我会写一个简单的课程,以满足此问题的目的:

class Tree:

    def __init__(self, name, branches):
        self.name = name
        self.branches = branches

现在我们可以重新创建您的数据结构:

data = eval("""[Tree('ROOT', [Tree('S', [Tree('INTJ', [Tree('UH', ['Hello'])]), Tree(',', [',']), Tree('NP', [Tree('PRP$', ['My']), Tree('NN', ['name'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('JJ', ['Melroy'])])]), Tree('.', ['.'])])]), Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('PRP$', ['your']), Tree('NN', ['name'])])]), Tree('.', ['?'])])])]""")

一旦我们有了这个,我们就可以编写一个函数来生成你想要的2元组列表:

def tails(items, path=()):
    for item in items:
        if isinstance(item, Tree):
            if item.name in {".", ","}:  # ignore punctuation
                continue
            for result in tails(item.branches, path + (item.name,)):
                yield result
        else:
            yield path[-2:]

此函数以递归方式下降到树中,每次命中适当的叶节点时都会产生最后两个Tree名称。

使用示例:

>>> list(tails(data))
[('INTJ', 'UH'), ('NP', 'PRP$'), ('NP', 'NN'), ('VP', 'VBZ'), ('ADJP', 'JJ'), ('WHNP', 'WP'), ('SQ', 'VBZ'), ('NP', 'PRP$'), ('NP', 'NN')]