Question

我试图为乳胶转换器创建一个简单的降价，只是为了学习python和基本的正则表达式，但我一直试图弄清楚为什么下面的代码不能工作：

re.sub (r'\[\*\](.*?)\[\*\]: ?(.*?)$',  r'\\footnote{\2}\1', s, flags=re.MULTILINE|re.DOTALL)

我想转换类似的内容：

s = """This is a note[*] and this is another[*]
[*]: some text
[*]: other text"""

为：

This is a note\footnote{some text} and this is another\footnote{other text}

这就是我得到的（使用上面的正则表达式）：

This is a note\footnote{some text} and this is another[*]

[*]: note 2

为什么模式只匹配一次？

编辑：

我尝试了以下前瞻断言：

re.sub(r'\[\*\](?!:)(?=.+?\[\*\]: ?(.+?)$',r'\\footnote{\1}',flags=re.DOTALL|re.MULTILINE)
#(?!:) is to prevent [*]: to be matched

现在它匹配所有脚注，但它们没有正确匹配。

s = """This is a note[*] and this is another[*]
[*]: some text
[*]: other text"""

给了我

This is a note\footnote{some text} and this is another\footnote{some text}
[*]: note 1
[*]: note 2

有什么想法吗？

Answer 1

原因是您无法多次匹配相同的字符。匹配一个字符后，它将由正则表达式引擎使用，不能再用于其他匹配。

（一般）解决方法包括使用捕获组捕获前瞻断言内的重叠部分。但是在你的情况下无法完成，因为没有办法区分哪个音符与占位符相关联。

更简单的方法是先在列表中提取所有注释，然后用回调替换每个占位符。例如：

import re

s='''This is a note[*] and this is another[*]
[*]: note 1
[*]: note 2'''

# text and notes are separated
[text,notes] = re.split(r'((?:\r?\n\[\*\]:[^\r\n]*)+$)', s)[:-1]

# this generator gives the next replacement string 
def getnote(notes):
    for note in re.split(r'\r?\n\[\*\]: ', notes)[1:]:
        yield r'\footnote{{{}}}'.format(note)

note = getnote(notes)

res = re.sub(r'\[\*\]', lambda m: note.next(), text)
print res

Answer 2

问题是，一旦你的正则表达式消耗了字符串的一部分，它就不会在匹配中重新考虑它。因此，在第一次更换后，它不会返回匹配第二个[*]，因为已经消耗了它。

这里你需要的是一个循环，在找到匹配项之前进行替换。像这样：

>>> str = 'This is a note[*] and this is another[*]\n\
... [*]: note 1\n\
... [*]: note 2'
>>> reg = r'(.*?)\[\*\](.*?)\[\*\]: (note \d)(.*)'
>>> 
>>> while re.search(reg, str, flags=re.MULTILINE|re.DOTALL):
...     str = re.sub(reg, r'\1\\footnote{\3}\2\4', str, flags=re.MULTILINE|re.DOTALL)
... 
>>> 
>>> str
'This is a note\\footnote{note 1} and this is another\\footnote{note 2}\n\n'

您可以稍微调整一下正则表达式，以消除结果字符串中的尾随换行符。啊!而且，您可以使用re.compile预编译正则表达式。

Python正则表达式只匹配一次

编辑：

2 个答案: