Question

我正在使用Python（3.6）开发一个项目，在该项目中，我需要从可能包含数千个文本文件的目录中读取文本文件，然后需要对它们进行一些分析并将结果上传到Google云端存储。发生编码错误。

这是我尝试过的：

来自views.py：

def predict_encoding(file_path, n_lines=60):
    '''Predict a file's encoding using chardet'''
    import chardet

    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.read() for _ in range(n_lines)])
    encoding = chardet.detect(rawdata)['encoding']
    print('Default encoding is: {}'.format(encoding))
    if encoding is None:
        rawdata.decode('utf8').encode('ascii', 'ignore')
        print('updated decoding is: {}'.format(chardet.detect(rawdata)['encoding']))
    return chardet.detect(rawdata)['encoding']


encoding = predict_encoding(text_path)
txt = Path(text_path).read_text(encoding=encoding)

但是对于某些文件（请参见下面的示例文件：），它返回如下错误：

/Users/abdul/Downloads/to_save/cert2.txt

默认编码为：无

更新后的解码为：无

返回codecs.charmap_decode（input，self.errors，decoding_table）[0]

UnicodeDecodeError：'charmap'编解码器无法解码位置339的字节0x81：字符映射到

这是返回此错误的示例： https://textuploader.com/d8ec5

Answer 1

您要分析的文件是图像（文件头中的Compress (tm) Xing Technology Corp）。因此，在检查编码之前，您需要检查文件是否为二进制。您可以为此使用following solution：

>>> is_binary_string(open(text_path, 'rb').read(1024))
True

UnicodeDecodeError：“字符映射”编解码器无法解码字节0x81

1 个答案: