我有一个PENN-Syntax-Tree,我想以递归方式获取该树包含的所有规则。
(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP ((DT a) (NN talk))))
)
)
我的目标是获得如下的语法规则:
ROOT --> S
S --> NP VP
NP --> NN
...
正如我所说,我需要递归地执行而没有NLTK包或任何其他模块或正则表达式。这是我到目前为止所拥有的。参数tree
是在每个空格上分割的Penn-Tree。
def extract_rules(tree):
tree = tree[1:-1]
print("\n\n")
if len(tree) == 0:
return
root_node = tree[0]
print("Current Root: "+root_node)
remaining_tree = tree[1:]
right_side = []
temp_tree = list(remaining_tree)
print("remaining_tree: ", remaining_tree)
symbol = remaining_tree.pop(0)
print("Symbol: "+symbol)
if symbol not in ["(", ")"]:
print("CASE: No Brackets")
print("Rule: "+root_node+" --> "+str(symbol))
right_side.append(symbol)
elif symbol == "(":
print("CASE: Opening Bracket")
print("Temp Tree: ", temp_tree)
cursubtree_end = bracket_depth(temp_tree)
print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end])
cursubtree_start = temp_tree.index(symbol)
cursubtree = temp_tree[cursubtree_start:cursubtree_end+1]
print("Subtree: ", cursubtree)
rnode = extract_rules(cursubtree)
if rnode:
right_side.append(rnode)
print("Rule: "+root_node+" --> "+str(rnode))
print(right_side)
return root_node
def bracket_depth(tree):
counter = 0
position = 0
subtree = []
for i, char in enumerate(tree):
if char == "(":
counter = counter + 1
if char == ")":
counter = counter - 1
if counter == 0 and i != 0:
counter = i
position = i
break
subtree = tree[0:position+1]
return position
目前它适用于S
的第一个子树,但所有其他子树都不会被递归解析。很高兴得到任何帮助..
答案 0 :(得分:4)
我倾向于尽可能简单地保持它,而不是试图重新发明你目前不允许使用的解析模块。类似的东西:
string = '''
(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP (DT a) (NN talk)))
)
)
'''
def is_symbol_char(character):
'''
Predicate to test if a character is valid
for use in a symbol, extend as needed.
'''
return character.isalpha() or character in '-=$!?.'
def tokenize(characters):
'''
Process characters into a nested structure. The original string
'(DT the)' is passed in as ['(', 'D', 'T', ' ', 't', 'h', 'e', ')']
'''
tokens = []
while characters:
character = characters.pop(0)
if character.isspace():
pass # nothing to do, ignore it
elif character == '(': # signals start of recursive analysis (push)
characters, result = tokenize(characters)
tokens.append(result)
elif character == ')': # signals end of recursive analysis (pop)
break
elif is_symbol_char(character):
# if it looks like a symbol, collect all
# subsequents symbol characters
symbol = ''
while is_symbol_char(character):
symbol += character
character = characters.pop(0)
# push unused non-symbol character back onto characters
characters.insert(0, character)
tokens.append(symbol)
# Return whatever tokens we collected and any characters left over
return characters, tokens
def extract_rules(tokens):
''' Recursively walk tokenized data extracting rules. '''
head, *tail = tokens
print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail])
for token in tail: # recurse
if isinstance(token, list):
extract_rules(token)
characters, tokens = tokenize(list(string))
# After a successful tokenization, all the characters should be consumed
assert not characters, "Didn't consume all the input!"
print('Tokens:', tokens[0], 'Rules:', sep='\n\n', end='\n\n')
extract_rules(tokens[0])
<强>输出强>
Tokens:
['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', ['DT', 'a'], ['NN', 'talk']]]]]
Rules:
ROOT --> S
S --> NP VP
NP --> NN DT NN
NN --> Carnac
DT --> the
NN --> Magnificent
VP --> VBD NP
VBD --> gave
NP --> DT NN
DT --> a
NN --> talk
注意强>
我将原始树改为此子句:
(NP ((DT a) (NN talk)))
似乎不正确,因为它在网络上提供的语法树grapher上生成一个空节点,因此我将其简化为:
(NP (DT a) (NN talk))
根据需要进行调整。
答案 1 :(得分:3)
这可以通过更简单的方式完成。鉴于我们知道我们的语法结构是CNF LR,我们可以使用递归正则表达式解析器来解析文本。
有一个名为pyparser的软件包(如果你还没有它,你可以用pip install pyparser
安装它。)
from pyparsing import nestedExpr
astring = '''(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP ((DT a) (NN talk))))
)
)'''
expr = nestedExpr('(', ')')
result = expr.parseString(astring).asList()[0]
print(result)
这给出了
['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', [['DT', 'a'], ['NN', 'talk']]]]]]
因此,我们已成功将字符串转换为列表层次结构。现在我们需要编写一些代码来解析列表并提取规则。
def get_rules(result, rules):
for l in result[1:]:
if isinstance(l, list) and not isinstance(l[0], list):
rules.add((result[0], l[0]))
get_rules(l, rules)
elif isinstance(l[0], list):
rules.add((result[0], tuple([x[0] for x in l])))
else:
rules.add((result[0], l))
return rules
正如我所提到的,我们已经知道语法的结构,所以我们只是在这里处理有限的条件。
这样调用此函数:
rules = get_rules(result, set()) # results was obtained from before
for i in rules:
print i
输出:
('ROOT', 'S')
('VP', 'NP')
('DT', 'the')
('NP', 'NN')
('NP', ('DT', 'NN'))
('NP', 'DT')
('S', 'VP')
('VBD', 'gave')
('NN', 'Carnac')
('NN', 'Magnificent')
('S', 'NP')
('VP', 'VBD')
根据需要订购。