Question

注意：我没有使用正则表达式解析大量或html或通用html。我知道这很糟糕

TL; DR ：

我有像

这样的字符串

A sentence with an exclamation\! Next is a \* character

原始标记中有“转义”字符的位置。我希望用他们的“原件”取而代之。得到：

A sentence with an exclamation! Next is a * character

我需要从一些wiki标记中提取一些小数据。

我这里只处理段落/片段，所以我不需要一个强大的解决方案。在python中，我尝试了一个测试：

s = "test \\* \\! test * !! **"

r = re.compile("""\\.""") # Slash followed by anything

r.sub("-", s)

这应该是：

test - - test * !! **

但它没有做任何事情。我在这里错过了什么吗？

此外，我不确定如何用原始字符替换任何给定的转义字符，所以我可能只是制作一个列表和子字符，具有特定的正则表达式：

\\\*

和

\\!

这可能是一种更清洁的方法，所以非常感谢任何帮助。

Answer 1

你遗漏了一些东西，即r前缀：

r = re.compile(r"\\.") # Slash followed by anything

python和re都附加了\的含义;当你将字符串值传递给re.compile()时，你的双倍反斜杠就变成了一个反斜杠，到那时re看到了\.，这意味着一个文字的句号。

>>> print """\\."""
\.

通过使用r''，你告诉python不要解释转义码，所以现在re被赋予一个带有\\.的字符串，这意味着一个字面反斜杠后跟任何字符：

>>> print r"""\\."""
\\.

演示：

>>> import re
>>> s = "test \\* \\! test * !! **"
>>> r = re.compile(r"\\.") # Slash followed by anything
>>> r.sub("-", s)
'test - - test * !! **'

经验法则是：在定义正则表达式时，使用r''原始字符串文字，使您不必双重转义对Python和正则表达式语法都有意义的所有内容。

接下来，您要替换'转义'字符;使用组，re.sub()允许您引用组作为替换值：

r = re.compile(r"\\(.)") # Note the parethesis, that's a capturing group
r.sub(r'\1', s)          # \1 means: replace with value of first capturing group

现在输出是：

>>> r = re.compile(r"\\(.)") # Note the parethesis, that's a capturing group
>>> r.sub(r'\1', s) 
'test * ! test * !! **'

正则表达式用其原始文件替换“转义”字符

1 个答案: