Question

我正在尝试从我在文本文件中读取的一些字符串变量中删除不可打印的字符。如果我使用下面的re.sub方法它不会工作\ x ..字符没有删除

test1 = 'ing record \xac\xd0\x81\xb4\x02\n2018 Apr'
test2 = re.sub('\\\\x(?:\d\d|\w\w|\d\w|\w\d)', '', test1)

但是，如果我从test1获取值并将其放在re.sub中作为＆＃34; raw＆＃34;字符串然后它完美地工作

test2 = re.sub('\\\\x(?:\d\d|\w\w|\d\w|\w\d)', '', r'ing record \xac\xd0\x81\xb4\x02\n2018 Apr')

test2有记录\ n2018 Apr＆＃39;

我希望能够轻松地将第一个示例中的test1转换为原始字符串，但我在搜索时看起来并不容易或不可能。寻找一个允许我使用re.sub并从str变量中删除这些字符的解决方案，或者是否有办法将我的str变量首先转换为原始字符串？

更新修正：我最终不得不做很多转换来删除不需要的十六进制代码，但保留我的换行符。这不能确定是否有更清洁的方法。

test33 = 'ing record \xac\xd0\x81\xb4\x02\n2018 Apr'
test44 = re.sub('\\\\x(?:\d\d|\w\w|\d\w|\w\d)', '', test33.encode('unicode-escape').decode("utf-8"))
test66 = test44.encode().decode('unicode-escape')
print(test66)

ing record 
2018 Apr

Answer 1

如果您的字符串是纯ASCII，您可以尝试：

import re
import string

test33 = 'ing record \xac\xd0\x81\xb4\x02\n2018 Apr'

print re.sub(r'[^{0}\n]'.format(string.printable), '', test33)

或Stripping non printable characters from a string in python

中提供的unicode解决方案

python使用原始字符串

1 个答案: