Question

我有一段代码，其中包含带有以十进制表示的UTF-8转义序列的字符串，例如

my_string = "Hello\035"

然后应解释为

Hello#

我不介意解析十进制值，到目前为止，我已经在整个字符串中使用了类似的方法，这似乎是最好的方法（没有错误，可以执行某些操作）：

print(codecs.escape_decode(my_string)[0].decode("utf-8"))

但是编号似乎不正确，因为我必须使用\ 043转义序列才能正确解码hastag（＃），其他所有字符都相同。

Answer 1

您不能明确地检测并替换字符串文字中的所有\ooo转义序列，因为在运行第一行代码之前，这些转义序列将被不可逆地替换为其相应的字符值。就Python而言，"foo\041"和"foo!"是100％相同的，并且无法确定前一个对象是使用转义序列定义的，而后者不是。

如果您对输入数据的格式有一定的灵活性，那么您仍然可以做您想做的事情。例如，如果允许您使用原始字符串而不是常规字符串，那么在运行时之前，r"Hello\035"不会被解释为“ Hello，后跟井号”。它将解释为“ Hello，后跟反斜杠，后跟0 3和5”。由于数字字符仍可访问，因此您可以在代码中对其进行操作。例如，

import re

def replace_decimal_escapes(s):
    return re.sub(
        #locate all backslashes followed by three digits
        r"\\(\d\d\d)",
        #fetch the digit group, interpret them as decimal integer, then get cooresponding char
        lambda x: chr(int(x.group(1), 10)), 
        s
    )

test_strings = [
    r"Hello\035",
    r"foo\041",
    r"The \040quick\041 brown fox jumps over the \035lazy dog"
]

for s in test_strings:
    result = replace_decimal_escapes(s)
    print("input:  ", s)
    print("output: ", result)

结果：

input:   Hello\035
output:  Hello#
input:   foo\041
output:  foo)
input:   The \040quick\041 brown fox jumps over the \035lazy dog
output:  The (quick) brown fox jumps over the #lazy dog

此外，如果您通过input()获得输入字符串，此方法也适用，因为用户在该提示中键入的反斜杠不会被解释为转义序列。如果您执行print(replace_decimal_escapes(input()))并且用户键入“ Hello \ 035”，则输出将根据需要为“ Hello＃”。

如何正确解码以十进制编写的字符串中的转义序列

1 个答案: