Question

我正在使用以下代码读取word文件：

import win32com.client as win32

word = win32.dynamic.Dispatch("Word.Application")
word.Visible = 0
doc = word.Documents.Open(SigLexiconFilePath)

我从包含很多不可打印字符的文件中获取字符串：

str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"

我尝试了以下代码来删除不可打印的字符：

import string 

str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"
filtered_string = "".join(filter(lambda x:x in string.printable, str))

这给了我下面的输出：

keinefreigb\x0b\r

我尝试过的其他代码：

str = str.split('\r')[0]
str = str.strip()

这给了我下面的输出：

keine\xa0freigäbü

如何使用最少的代码删除所有这些不可打印的字符，以使其低于所需的输出：

keine freigäbü

Answer 1

这些字符似乎都是空格字符。您可以尝试使用Python的unicodedata模块将其中的一些一致地转换为适当的空白字符：

>>> unicodedata.normalize("NFKD","\xa0keine\xa0freigäbü\xa0\x0b\r\x07")
' keine freigäbü \x0b\r\x07'

然后，如果要删除的字符集不是很多，则可以进行一系列替换和剥离命令以获取所需的内容。

>>> ' keine freigäbü \x0b\r\x07'.replace("\x0b"," ").replace("\r"," ").\
        replace("\x07"," ").strip()
'keine freigäbü'

希望这些帮助。

Answer 2

尝试使用此行。

import re

def convert_tiny_str(x:str):
    """ Taking in consideration this:

    > https://www.ascii-code.com/

    Citting: "The first 32 characters in the ASCII-table are unprintable control
    codes and are used to control peripherals such as printers." 
    From Hex code 00 to Hec code 2F, [00, 2F].

    Now, from ASCII Extended, the printable characters are listed
    from \x20 to \xFF in Hexadecimal code, [20, FF].

    For that the Regular Expression that I can show like a possible
    solution it is this:

    1- Replace "all the characers, except the printable characters", by a ''.

    2- Then, the character \xa0 it is still componing the str result.
    Replace it by an ' '.
    """

    _out = re.sub(r'[^\x20-\xff]',r'', _str)
    # >> '\xa0keine\xa0freigäbü\xa0'

    return re.sub(r'\xa0',r' ', _out)


_str = "\xa0keine\xa0freigäbü\xa0\x0b\r\x07"
x = convert_tiny_str(_str)

print(x)
# >>' keine freigäbü '

完成。

如何从字符串中删除不可打印的字符？

2 个答案: