我想要捕获标签内的所有内容以及它后面的下一行,但它假设在下次遇到括号时停止。我做错了什么?
import re #regex
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
(\b(?:.|\s)*(?!\[)) # should read: anyword that doesn't precede a bracket
""", re.MULTILINE | re.VERBOSE)
haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
@[this should be taken though as this is in the content]
[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m
我想要的是:
[('tab1','这是捕获的\ n但是这也可以被捕获!\ n @ [这应该被拍摄,因为这是在内容中] \ n','[tab2]','帮助我\ n \ n更好的RE \ n')]
编辑:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
这似乎有效但它也在修剪内容中的括号。
答案 0 :(得分:3)
Python正则表达式不支持递归afaik。
编辑:但在你的情况下,这将有效:
regex = re.compile(r"""
^ # Must start in a newline first
\[(.*?)\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^\[]*) # stop reading at opening bracket
""", re.MULTILINE | re.VERBOSE)
编辑2:是的,它无法正常工作。
import re
regex = re.compile(r"""
(?:^|\n)\[ # tag's opening bracket
([^\]\n]*) # 1. text between brackets
\]\n # tag's closing bracket
(.*?) # 2. text between the tags
(?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
""", re.DOTALL | re.VERBOSE)
haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag
[tag2]
help me
write a better RE[[[]
"""
print regex.findall(haystack)
我确实同意viraptor。正则表达式很酷但你无法检查你的文件是否有错误。也许混合动力? :P
tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))
result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()
print result
编辑3:那是因为^
字符仅表示[^squarebrackets]
内的否定匹配。其他任何地方都意味着字符串开始(或以re.MULTILINE
开头)。正则表达式中的负字符串匹配没有好方法,只有字符。
答案 1 :(得分:3)
首先,如果你想解析一个正则表达式?正如您所看到的,您自己找不到问题的根源,因为正则表达式没有给出任何反馈。此外,RE中没有任何递归。
让你的生活变得简单:
def ini_parse(src):
in_block = None
contents = {}
for line in src.split("\n"):
if line.startswith('[') and line.endswith(']'):
in_block = line[1:len(line)-1]
contents[in_block] = ""
elif in_block is not None:
contents[in_block] += line + "\n"
elif line.strip() != "":
raise Exception("content out of block")
return contents
您可以通过异常获得错误处理,并可以将执行调试作为奖励。您还可以获得字典作为结果,并且可以在处理时处理重复的部分。我的结果:
{'tab2': 'help me\nwrite a better RE\n\n',
'tab1': 'this is captured\nbut this is suppose to be captured too!\n@[this should be taken though as this is in the content]\n\n'}
这些天RE过度使用......
答案 2 :(得分:2)
这样做你想要的吗?
regex = re.compile(r"""
^ # Must start in a newline first
\[\b(.*)\b\] # Get what's enclosed in brackets
\n # only capture bracket if a newline is next
([^[]*)
""", re.MULTILINE | re.VERBOSE)
这给出了一个元组列表(每个匹配一个2元组)。如果你想要一个扁平的元组,你可以写:
m = sum(regex.findall(haystack), ())