我有这个斯坦福树,我想把它转换成新的格式。
(ROOT
(S
(NP (DT A) (NN friend))
(VP
(VBZ comes)
(NP
(NP (JJ early))
(, ,)
(NP
(NP (NNS others))
(SBAR
(WHADVP (WRB when))
(S (NP (PRP they)) (VP (VBP have) (NP (NN time))))))))))
答案 0 :(得分:1)
可能有一些方法只使用字符串处理来执行此操作,但我会解析它们并以递归方式以newick格式打印它们。一个有点最小的实现:
import re
class Tree(object):
def __init__(self, label):
self.label = label
self.children = []
@staticmethod
def _tokenize(string):
return list(reversed(re.findall(r'\(|\)|[^ \n\t()]+', string)))
@classmethod
def from_string(cls, string):
tokens = cls._tokenize(string)
return cls._tree(tokens)
@classmethod
def _tree(cls, tokens):
t = tokens.pop()
if t == '(':
tree = cls(tokens.pop())
for subtree in cls._trees(tokens):
tree.children.append(subtree)
return tree
else:
return cls(t)
@classmethod
def _trees(cls, tokens):
while True:
if not tokens:
raise StopIteration
if tokens[-1] == ')':
tokens.pop()
raise StopIteration
yield cls._tree(tokens)
def to_newick(self):
if self.children and len(self.children) == 1:
return ','.join(child.to_newick() for child in self.children)
elif self.chilren:
return '(' + ','.join(child.to_newick() for child in self.children) + ')'
else:
return self.label
注意,当然,在转换过程中信息会丢失,因为只保留终端节点。用法:
>>> s = """(ROOT (..."""
>>> Tree.from_string(s).to_newick()
...