Question

我有一个二进制文件，我想提取所有ascii字符，而忽略非ascii字符。目前我有：

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

但是，在写入文件UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)时遇到错误。我如何让Python忽略非ascii？

Answer 1

基本上，ASCII表在[0,2 ⁷）范围内取值，并将它们与（可写或不可写）字符相关联。因此，要忽略非ASCII字符，您只需忽略其代码不包含在[0,2 ⁷）中，即低于或等于127的字符。

在python中，有一个名为ord的函数，它相应于docstring

返回单字符字符串的整数序号。

换句话说，它为您提供了角色的代码。现在，您必须忽略传递给ord的所有字符，返回大于128的值。这可以通过以下方式完成：

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

现在，如果您只想保留可打印的字符，您必须注意到所有这些字符 - 至少在ASCII表中 - 介于32（空格）和126（代字号）之间，所以你必须简单地做：

if 32 <= ord(character) <= 126:

Python将二进制文件转换为字符串，同时忽略非ascii字符

1 个答案: