我有
(LEFT-WALL)(who)(is.v)(Obama)(,)(I.p)(love.v)(his)(speech.s)(RIGHT-WALL)
一种模式,我将其拆分并获取列表中的每个括号项。我的正则表达式工作正常,但对于嵌套文本,如(Ob(am)a)
示例:
post_script_word_str = '(LEFT-WALL)(who)(is.v)(Obama)(,)(I.p)(love.v)(his)(speech.s)(RIGHT-WALL)'
post_script_word_list = re.compile(r'\(([^\)\(]*)\)').split(post_script_word_str)
print post_script_word_list
post_script_link_str = '[0 12 4 (RW)][0 7 3 (Xx)][0 1 0 (Wd)][1 2 0 (Ss)][2 6 2 (Ost)][3 6 1 (Ds)][3 4 0 (La)][5 6 0 (AN)][7 8 0 (Wq)][8 9 0 (EAh)][9 10 0 (AF)][10 11 0 (SIs)]'
post_script_link_str = re.compile(r'\[([^\]\[]*)\]').split(post_script_link_str)
print post_script_link_str
结果:
['', 'LEFT-WALL', '', 'who', '', 'is.v', 'Obama', ',', '', 'I.p', '', 'love.v', '', 'his', '', 'speech.s', '', 'RIGHT-WALL', '']
['', '0 12 4 (RW)', '', '0 7 3 (Xx)', '', '0 1 0 (Wd)', '', '1 2 0 (Ss)', '', '2 6 2 (Ost)', '', '3 6 1 (Ds)', '', '3 4 0 (La)', '', '5 6 0 (AN)', '', '7 8 0 (Wq)', '', '8 9 0 (EAh)', '', '9 10 0 (AF)', '', '10 11 0 (SIs)', '']
但是对于像(Ob(am)a)
或[0 [1]2 4 (RW)]
这样的输入,它会失败。我期望与上面相同的结果,但它给出了
['', 'LEFT-WALL', '', 'who', '', 'is.v', '(Ob', 'am', 'a)', ',', '', 'I.p', '', 'love.v', '', 'his', '', 'speech.s', '', 'RIGHT-WALL', '']
['[0 ', '1', '2 4 (RW)]', '0 7 3 (Xx)', '', '0 1 0 (Wd)', '', '1 2 0 (Ss)', '', '2 6 2 (Ost)', '', '3 6 1 (Ds)', '', '3 4 0 (La)', '', '5 6 0 (AN)', '', '7 8 0 (Wq)', '', '8 9 0 (EAh)', '', '9 10 0 (AF)', '', '10 11 0 (SIs)', '']
任何建议?
更新了输入:
post_script_link_str = '[0 [1]2 4 (RW)][0 7 3 (Xx)][0 1 0 (Wd)][1 2 0 (Ss)][2 6 2 (Ost)][3 6 1 (Ds)][3 4 0 (La)][5 6 0 (AN)][7 8 0 (Wq)][8 9 0 (EAh)][9 10 0 (AF)][10 11 0 (SIs)]'
结果:
['[0 ', '1', '2 4 (RW)]', '0 7 3 (Xx)', '', '0 1 0 (Wd)', '', '1 2 0 (Ss)', '', '2 6 2 (Ost)', '', '3 6 1 (Ds)', '', '3 4 0 (La)', '', '5 6 0 (AN)', '', '7 8 0 (Wq)', '', '8 9 0 (EAh)', '', '9 10 0 (AF)', '', '10 11 0 (SIs)', '']
答案 0 :(得分:2)
re模块无法处理嵌套结构。您需要使用具有递归功能的new regex module。顺便说一句,我认为findall
方法更适合这项工作:
regex.findall(r'\[([^][]*+(?:(?R)[^][]*)*+)]', post_script_link_str)
您获得:
['0 [1]2 4 (RW)', '0 7 3 (Xx)', '0 1 0 (Wd)', '1 2 0 (Ss)', '2 6 2 (Ost)', '3 6 1 (Ds)', '3 4 0 (La)', '5 6 0 (AN)', '7 8 0 (Wq)', '8 9 0 (EAh)', '9 10 0 (AF)', '10 11 0 (SIs)']
现在您只需要映射列表以删除方括号。
模式细节:
(?R)
允许递归,因为它是整个模式的别名。
*+
是占有量词。它与*
相同,但不允许正则表达式引擎回溯。如果不幸的是,托架不平衡,这里用它来防止灾难性的回溯。