我想要一个可以清理字符串的函数。消毒剂返回的字符串应只包含ASCII字符#32(空格字符)至ASCII#126('〜')。
ASCII字符#9(制表符)将替换为四个空格。所有其他非法字符都将替换为空字符串。例如,“ \ n”将替换为空字符串。我们不希望将非法字符替换为表示相关转义序列的字符串。例如,我们 不 要换行符替换为反斜杠字符和'n'字符。
如果最终字符串是Unicode编码而不是ASCII编码,那就很好。我只希望唯一允许的字符如下:
" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
示例用法:
unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"
safe_string = sanitize(unsafe_string)
print(safe_string)
输出:
APPLES AND BANANAS
以下尝试的解决方案不起作用,因为它们无法过滤掉换行符。
import string
import re
unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"
safe_string = re.sub(r'[^\x00-\x7f]',r'', unsafe_string)
print(safe_string)
printable = set(string.printable)
safe_string = ''.join(filter(lambda x: x in printable, unsafe_string))
print(safe_string)
答案 0 :(得分:3)
import re
def sanitize(s):
s = s.replace("\t", " ")
return re.sub(r"[^ -~]", "", s)
[ -~]
的含义是“ (空格)到
~
范围内的所有内容”。在开头添加^
意味着除此以外的所有内容。
输出为:
APPLES AND BANANAS
在示例输出中,您忘记了用空格替换制表符。
答案 1 :(得分:0)
您可以遍历字符,获取代码点并检查允许的值:
def sanitize(unsafe_str):
allowed_range = set(range(32, 127))
safe_str = ''
for char in unsafe_str:
cp = ord(char)
if cp in allowed_range:
safe_str += char
elif cp == 9:
safe_str += ' ' * 4
return re.sub(r'\s+', ' ', safe_str)
示例:
In [1042]: unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"
In [1043]: def sanitize(unsafe_str):
...: allowed_range = set(range(32, 127))
...: safe_str = ''
...: for char in unsafe_str:
...: cp = ord(char)
...: if cp in allowed_range:
...: safe_str += char
...: elif cp == 9:
...: safe_str += ' ' * 4
...: return re.sub(r'\s+', ' ', safe_str)
...:
...:
In [1044]: sanitize(unsafe_string)
Out[1044]: 'APPLES AND BANANAS'
最后一个re.sub(r'\s+', ' ', safe_str)
块是将空格压缩为一个。如果您不希望这样做,请return safe_str
:
In [1046]: def sanitize(unsafe_str):
...: allowed_range = set(range(32, 127))
...: safe_str = ''
...: for char in unsafe_str:
...: cp = ord(char)
...: if cp in allowed_range:
...: safe_str += char
...: elif cp == 9:
...: safe_str += ' ' * 4
...: return safe_str
...:
In [1047]: sanitize(unsafe_string)
Out[1047]: 'APPLES AND BANANAS'
FWIW,这会在每次运行函数时生成允许的列表,但是由于它是一个常量,因此可以将其放在模块级别以使其仅生成一次,例如:
ALLOWED_RANGE = set(range(32, 127))