如何清理字符串,使其仅包含可打印的ASCII字符?

时间:2019-06-29 18:57:25

标签: python string io ascii

我想要一个可以清理字符串的函数。消毒剂返回的字符串应只包含ASCII字符#32(空格字符)至ASCII#126('〜')。

ASCII字符#9(制表符)将替换为四个空格。所有其他非法字符都将替换为空字符串。例如,“ \ n”将替换为空字符串。我们不希望将非法字符替换为表示相关转义序列的字符串。例如,我们 要换行符替换为反斜杠字符和'n'字符。

如果最终字符串是Unicode编码而不是ASCII编码,那就很好。我只希望唯一允许的字符如下:

" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"

示例用法:

unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"
safe_string = sanitize(unsafe_string)
print(safe_string)

输出:

APPLES                     AND BANANAS   

编辑:

以下尝试的解决方案不起作用,因为它们无法过滤掉换行符。

import string
import re

unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"

safe_string = re.sub(r'[^\x00-\x7f]',r'', unsafe_string) 
print(safe_string)    

printable = set(string.printable)
safe_string = ''.join(filter(lambda x: x in printable, unsafe_string))
print(safe_string)

2 个答案:

答案 0 :(得分:3)

import re

def sanitize(s):
    s = s.replace("\t", "    ")
    return re.sub(r"[^ -~]", "", s)

[ -~]的含义是“ (空格)到~范围内的所有内容”。在开头添加^意味着除此以外的所有内容。

输出为:

APPLES                     AND BANANAS

在示例输出中,您忘记了用空格替换制表符。

答案 1 :(得分:0)

您可以遍历字符,获取代码点并检查允许的值:

def sanitize(unsafe_str): 
    allowed_range = set(range(32, 127)) 
    safe_str = '' 
    for char in unsafe_str: 
        cp = ord(char) 
        if cp in allowed_range: 
            safe_str += char 
        elif cp == 9: 
            safe_str += ' ' * 4 
    return re.sub(r'\s+', ' ', safe_str) 

示例:

In [1042]: unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"                                                                                                                             

In [1043]: def sanitize(unsafe_str): 
      ...:     allowed_range = set(range(32, 127)) 
      ...:     safe_str = '' 
      ...:     for char in unsafe_str: 
      ...:         cp = ord(char) 
      ...:         if cp in allowed_range: 
      ...:             safe_str += char 
      ...:         elif cp == 9: 
      ...:             safe_str += ' ' * 4 
      ...:     return re.sub(r'\s+', ' ', safe_str) 
      ...:      
      ...:                                                                                                                                                                                                  

In [1044]: sanitize(unsafe_string)                                                                                                                                                                          
Out[1044]: 'APPLES AND BANANAS'

最后一个re.sub(r'\s+', ' ', safe_str)块是将空格压缩为一个。如果您不希望这样做,请return safe_str

In [1046]: def sanitize(unsafe_str): 
      ...:     allowed_range = set(range(32, 127)) 
      ...:     safe_str = '' 
      ...:     for char in unsafe_str: 
      ...:         cp = ord(char) 
      ...:         if cp in allowed_range: 
      ...:             safe_str += char 
      ...:         elif cp == 9: 
      ...:             safe_str += ' ' * 4 
      ...:     return safe_str 
      ...:                                                                                                                                                                                                     

In [1047]: sanitize(unsafe_string)                                                                                                                                                                          
Out[1047]: 'APPLES                     AND BANANAS'

FWIW,这会在每次运行函数时生成允许的列表,但是由于它是一个常量,因此可以将其放在模块级别以使其仅生成一次,例如:

ALLOWED_RANGE = set(range(32, 127))