Question

Python正则表达式的专家！我正在尝试更改xml文档中的一行。原始行是：

<Tag name="low"     Value="%hello%\dir"/>

我想看到的结果是：

<Tag name="low"     Value="C:\art"/>

我失败的直接尝试是：

lines = re.sub("%hello%\dir"", "C:\art"/>

这不起作用。不会改变文档中的任何内容。有什么东西％？

出于测试目的，我尝试了：

lines = re.sub("dir", "C:\art", a)

我得到了：

<Tag name="low"     Value="%hello%\C:BELrt"/>

问题是\ a = BEL。

我尝试过其他一些东西，但无济于事。我该如何解决这个问题？

Answer 1

在Python中，使用r前缀作为文字字符串，以防止必须转义斜杠。然后转义斜线以避免\d匹配数字。

lines = re.sub(r"%hello%\\dir", r"C:\\art")

Answer 2

你的问题是，你有一些在正则表达式中具有特定含义的字符。

\d表示任何数字。然后%hello%\dir是%hello%[0-9]ir

你需要逃避这些斜线/使用原始字符串来解决这个问题：

a = '''<Tag name="low" Value="%hello%\dir"/>'''
lines = re.sub(r"%hello%\\dir", r"C:\\art", a)
print(lines) #<Tag name="low"     Value="C:\\art"/>

Answer 3

这是一个很好的问题。它同时显示了三个文本表示问题：

'\a' Python字符串文字是单个BELL字符。

要在Python源代码中输入反斜杠后跟字母'a'，您需要使用原始文字：r'\a'或转义斜杠'\\a'。
r'\d'（两个字符）在被解释为正则表达式时具有特殊含义（r'\d'表示匹配正则表达式中的数字）。

除了Python字符串文字的规则之外，您还需要转义可能的正则表达式元字符。您可以在一般情况下使用re.escape(your_string)，或仅使用r'\\d'或'\\\\d'。 '\a'部分中的repl也应该转义（在您的情况下为r'\\a'或'\\\\a'两次）：
```
>>> old, new = r'%hello%\dir', r'C:\art'
>>> print re.sub(re.escape(old), new.encode('string-escape'), xml)
<Tag name="low"     Value="C:\art"/>
```
顺便说一句，在这种情况下你完全不需要正则表达式：
```
>>> print xml.replace(old, new)
<Tag name="low"     Value="C:\art"/>
```
最后XML attribute value can't contain certain characters也应进行转义，例如'&'，'"'，"<"等。

通常，您不应该使用正则表达式来操纵XML。 Python的stdlib有XML解析器。

>>> import xml.etree.cElementTree as etree
>>> xml = r'<Tag name="low"     Value="%hello%\dir"/>'
>>> tag = etree.fromstring(xml)
>>> tag.set('Value', r"C:\art & design")
>>> etree.dump(tag)
<Tag Value="C:\art &amp; design" name="low" />

使用Python正则表达式查找/替换文档中的URL

3 个答案: