Question

我的程序如下所示：

import re
# Escape the string, in case it happens to have re metacharacters
my_str = "The quick brown fox jumped"
escaped_str = re.escape(my_str)
# "The\\ quick\\ brown\\ fox\\ jumped"
# Replace escaped space patterns with a generic white space pattern
spaced_pattern = re.sub(r"\\\s+", r"\s+", escaped_str)
# Raises error

错误是这样的：

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/swfarnsworth/programs/pycharm-2019.2/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "/home/swfarnsworth/programs/pycharm-2019.2/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/swfarnsworth/projects/medaCy/medacy/tools/converters/con_to_brat.py", line 255, in <module>
    content = convert_con_to_brat(full_file_path)
  File "/home/swfarnsworth/projects/my_file.py", line 191, in convert_con_to_brat
    start_ind = get_absolute_index(text_lines, d["start_ind"], d["data_item"])
  File "/home/swfarnsworth/projects/my_file.py", line 122, in get_absolute_index
    entity_pattern_spaced = re.sub(r"\\\s+", r"\s+", entity_pattern_escaped)
  File "/usr/local/lib/python3.7/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/local/lib/python3.7/re.py", line 309, in _subx
    template = _compile_repl(template, pattern)
  File "/usr/local/lib/python3.7/re.py", line 300, in _compile_repl
    return sre_parse.parse_template(repl, pattern)
  File "/usr/local/lib/python3.7/sre_parse.py", line 1024, in parse_template
    raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \s at position 0

即使我删除了'\s+'之前的两个反斜杠，或者将原始字符串（r"\\\s+"）变成了常规字符串，我仍然会收到此错误。我检查了Python 3.7文档，看来\s仍然是空白的转义序列。

Answer 1

尝试摆弄反斜杠，以避免正则表达式尝试解释\s：

spaced_pattern = re.sub(r"\\\s+", "\\\s+", escaped_str)

现在

>>> spaced_pattern
'The\\s+quick\\s+brown\\s+fox\\s+jumped'
>>> print(spaced_pattern)
The\s+quick\s+brown\s+fox\s+jumped

但是为什么？

似乎python试图像解释\s那样解释r"\n"，而不是像通常的Python那样单独解释。如果是这样。例如：

re.sub(r"\\\s+", r"\n+", escaped_str)

产量：

The
+quick
+brown
+fox
+jumped

即使在原始字符串中使用了\n。

此更改是在Issue #27030: Unknown escapes consisting of '\' and ASCII letter in

中引入的

进行替换的代码位于sre_parse.py（python 3.7）中：

        else:
            try:
                this = chr(ESCAPES[this][1])
            except KeyError:
                if c in ASCIILETTERS:
                    raise s.error('bad escape %s' % this, len(this))

此代码查找文字\后面的内容，并尝试将其替换为适当的非ascii字符。显然s不在ESCAPES字典中，因此会触发KeyError异常，然后触发您所收到的消息。

在以前的版本中，它只是发出警告：

import warnings
warnings.warn('bad escape %s' % this,
              DeprecationWarning, stacklevel=4)

看起来并非只有我们一个人经历了3.6到3.7的升级：https://github.com/gi0baro/weppy/issues/227

Answer 2

在替换字符串方面，正则表达式引擎的行为方式（大部分）相同（
）交给他们的。
他们尝试插入与转义字符等效的控制代码，例如制表符crlf等...
它无法识别的任何转义序列，只会剥离转义。

给定
spaced_pattern = re.sub(r"\\\s+", r"\s+", escaped_str)

r"\s+"将替换字符串\s+交给引擎。
由于没有这样的转义序列，它只是剥离了转义
并将s+插入替换位置。

您可以在这里https://regex101.com/r/42QCvi/1看到它
没有引发任何错误，但是应该是由于您没有得到您认为的应有的错误。

实际上，应该始终对字面转义进行转义
如此处https://regex101.com/r/bzQgfN/1

所示

没什么新鲜的，他们只是说这是一个错误，但实际上是一个通知警告
你没有得到自己的想法。
多年来一直这样。有时是错误，有时不是。

Answer 3

只需尝试使用import regex as re而不是import re。

Answer 4

如果您尝试用单个反斜杠替换任何内容，Python 3.8.5 的 re 和 regex 包都无法执行独自一人。

我依赖的解决方案是在 re.sub 和 Python 的 replace 之间拆分任务：

import re
re.sub(r'([0-9.]+)\*([0-9.]+)',r'\1 XBACKSLASHXcdot \2'," 4*2").replace('XBACKSLASHX','\\')

Answer 5

我想你可能正在尝试做

import re
# Escape the string, in case it happens to have re metacharacters
my_str = "The\\ quick\\ brown\\ fox\\ jumped"
escaped_str = re.escape(my_str)
# "The\\ quick\\ brown\\ fox\\ jumped"
# Replace escaped space patterns with a generic white space pattern
print(re.sub(r"\\\\\\\s+", " ", escaped_str))

输出1

The quick brown fox jumped

如果您可能想使用文字\ s +，请尝试this answer或

import re
# Escape the string, in case it happens to have re metacharacters
my_str = "The\\ quick\\ brown\\ fox\\ jumped"
escaped_str = re.escape(my_str)
print(re.sub(r"\\\\\\\s+", re.escape(r"\s") + '+', escaped_str))

输出2

The\s+quick\s+brown\s+fox\s+jumped

或者也许：

import re
# Escape the string, in case it happens to have re metacharacters
my_str = "The\\ quick\\ brown\\ fox\\ jumped"
print(re.sub(r"\s+", "s+", my_str))

输出3

The\s+quick\s+brown\s+fox\s+jumped

如果您希望简化/修改/探索表达式，请在regex101.com的右上角进行说明。如果愿意，您还可以在this link中查看它如何与某些示例输入匹配。

RegEx电路

jex.im可视化正则表达式：

Python 3.7.4：'re.error：位置0处的转义错误'

5 个答案:

但是为什么？

输出1

输出2

输出3

RegEx电路

Demo