Question

我正在构建一个将*asterisks*转换为<b>bold tags</b>的正则表达式，作为Markdown的简单版本。正则表达式看起来像这样：

markdown = '\*(?P<name>.+)\*'
bold = '<b>\g<name></b>'
text = 'abcdef *bold* ghijkl'
print(re.sub(markdown, bold, text))

>>> abcdef <b>bold</b> ghijkl

现在我需要忽略转义的星号\*，在这里我遇到两个问题：

问题1

当我尝试将转义符号指定为\\

时

markdown = '[^\\]\*(?P<name>.+)[^\\]\*'

我收到Python错误：

sre_constants.error: unexpected end of regular expression

因此某处存在语法错误，似乎无法修复它。

问题2

假设我想忽略前置符号A（不是反斜杠）。我的正则表达式在这里工作：

markdown = '[^A]\*(?P<name>.+)[^A]\*'
bold = '<b>\g<name></b>'
text = 'abcdef A*bold* ghijkl'
print(re.sub(markdown, bold, text))

>>> abcdef A*bold* ghijkl

但如果我的行中没有前置A，那么正则表达式会消耗文本中的一些有价值的符号：

text = 'abcdef *bold* ghijkl'
print(re.sub(markdown, bold, text))

>>> abcdef<b>bol</b> ghijkl

请注意，第一个空格和字母d已消失。

我如何处理这两个问题？

Answer 1

问题1a：语法错误

它不起作用，因为你也必须逃避反斜杠。

markdown = '[^\\\\]\*(?P<name>.+)[^\\\\]\*'

或使用r''定义原始字符串。

markdown = r'[^\\]\*(?P<name>.+)[^\\]\*'

问题1b：解决方案

我的建议：不要试图用一个正则表达式来解决它。

自定义转义有问题的角色。
运行正常的正则表达式。
撤消自定义转义。

代码：

my_escapes = {
    '%backslash-escaped%': '\\\\',
    '%bold-escaped%': '\\*',
}

text = r'text \*not-bold text2 *bold* text3 \\*bold* text4 \\\*not-bold text5 \\\\*bold* text6'
text = re.sub('\\\\\\\\', '%backslash-escaped%', text)  # escape escaped escape characters
text = re.sub('\\\\\*', '%bold-escaped%', text)  # escape escaped bold characters
text = re.sub('(?<!\\\\)\*(?P<bold>[^\*\\\\]+)\*', '<b>\g<bold></b>', text)  # add bold parts

# undo all escapes
for key, value in my_escapes.iteritems():
    text = text.replace(key, value)

print text

>>> text \*not-bold text2 <b>bold</b> text3 \\<b>bold</b> text4 \\\*not-bold text5 \\\\<b>bold</b> text6

问题2：字符消失

它们消失了，因为你已经匹配但没有重新插入它们。为此，请将它们分组（此处为命名组）和将组插入replacement-string中。

markdown = '(?P<first_char>[^A])\*(?P<name>.+)(?P<sec_char>[^A])\*'
bold = '\g<first_char><b>\g<name></b>\g<sec_char>'

或者使用lookarounds，它们会匹配但不会消耗该字符。

markdown = '(?<!A)\*(?P<name>.+)(?!A)\*'
bold = '<b>\g<name></b>'

正则表达式：找到没有前置反斜杠的所有星号

问题1

问题2

1 个答案: