Question

我想要捕获标签内的所有内容以及它后面的下一行，但它假设在下次遇到括号时停止。我做错了什么？

import re #regex

regex = re.compile(r"""
         ^                    # Must start in a newline first
         \[\b(.*)\b\]         # Get what's enclosed in brackets 
         \n                   # only capture bracket if a newline is next
         (\b(?:.|\s)*(?!\[))  # should read: anyword that doesn't precede a bracket
       """, re.MULTILINE | re.VERBOSE)

haystack = """
[tab1]
this is captured
but this is suppose to be captured too!
@[this should be taken though as this is in the content]

[tab2]
help me
write a better RE
"""
m = regex.findall(haystack)
print m

我想要的是：
[（'tab1'，'这是捕获的\ n但是这也可以被捕获！\ n @ [这应该被拍摄，因为这是在内容中] \ n'，'[tab2]'，'帮助我\ n \ n更好的RE \ n'）]

编辑：

regex = re.compile(r"""
             ^           # Must start in a newline first
             \[(.*?)\]   # Get what's enclosed in brackets 
             \n          # only capture bracket if a newline is next
             ([^\[]*)    # stop reading at opening bracket
        """, re.MULTILINE | re.VERBOSE)

这似乎有效但它也在修剪内容中的括号。

Answer 1

Python正则表达式不支持递归afaik。

编辑：但在你的情况下，这将有效：

regex = re.compile(r"""
         ^           # Must start in a newline first
         \[(.*?)\]   # Get what's enclosed in brackets 
         \n          # only capture bracket if a newline is next
         ([^\[]*)    # stop reading at opening bracket
    """, re.MULTILINE | re.VERBOSE)

编辑2：是的，它无法正常工作。

import re

regex = re.compile(r"""
    (?:^|\n)\[             # tag's opening bracket  
        ([^\]\n]*)         # 1. text between brackets
    \]\n                   # tag's closing bracket
    (.*?)                  # 2. text between the tags
    (?=\n\[[^\]\n]*\]\n|$) # until tag or end of string but don't consume it
    """, re.DOTALL | re.VERBOSE)

haystack = """[tag1]
this is captured [not a tag[
but this is suppose to be captured too!
[another non-tag

[tag2]
help me
write a better RE[[[]
"""

print regex.findall(haystack)

我确实同意viraptor。正则表达式很酷但你无法检查你的文件是否有错误。也许混合动力？：P

tag_re = re.compile(r'^\[([^\]\n]*)\]$', re.MULTILINE)
tags = list(tag_re.finditer(haystack))

result = {}
for (mo1, mo2) in zip(tags[:-1], tags[1:]):
    result[mo1.group(1)] = haystack[mo1.end(1)+1:mo2.start(1)-1].strip()
result[mo2.group(1)] = haystack[mo2.end(1)+1:].strip()

print result

编辑3：那是因为^字符仅表示[^squarebrackets]内的否定匹配。其他任何地方都意味着字符串开始（或以re.MULTILINE开头）。正则表达式中的负字符串匹配没有好方法，只有字符。

Answer 2

首先，如果你想解析一个正则表达式？正如您所看到的，您自己找不到问题的根源，因为正则表达式没有给出任何反馈。此外，RE中没有任何递归。

让你的生活变得简单：

def ini_parse(src):
   in_block = None
   contents = {}
   for line in src.split("\n"):
      if line.startswith('[') and line.endswith(']'):
         in_block = line[1:len(line)-1]
         contents[in_block] = ""
      elif in_block is not None:
         contents[in_block] += line + "\n"
      elif line.strip() != "":
         raise Exception("content out of block")
   return contents

您可以通过异常获得错误处理，并可以将执行调试作为奖励。您还可以获得字典作为结果，并且可以在处理时处理重复的部分。我的结果：

{'tab2': 'help me\nwrite a better RE\n\n',
 'tab1': 'this is captured\nbut this is suppose to be captured too!\n@[this should be taken though as this is in the content]\n\n'}

这些天RE过度使用......

Answer 3

这样做你想要的吗？

regex = re.compile(r"""
         ^                      # Must start in a newline first
         \[\b(.*)\b\]           # Get what's enclosed in brackets 
         \n                     # only capture bracket if a newline is next
         ([^[]*)
       """, re.MULTILINE | re.VERBOSE)

这给出了一个元组列表（每个匹配一个2元组）。如果你想要一个扁平的元组，你可以写：

m = sum(regex.findall(haystack), ())

我在python中的正则表达式没有正确递归

3 个答案: