UnicodeDecodeError'utf-8'编解码器无法解码位置2893上的字节0x92:无效的起始字节

时间:2019-04-25 20:22:04

标签: python character-encoding

我正在尝试打开一系列HTML文件,以便使用BeautifulSoup从这些文件的正文中获取文本。我有大约435个文件要运行,但是一直出现此错误。

我尝试将HTML文件转换为文本并打开文本文件,但遇到相同的错误...

path = "./Bitcoin"
for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

我想获取HTML文件的源代码,以便可以使用beautifulsoup解析它,但出现此错误

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-133-f32d00599677> in <module>
      3 for file in os.listdir(path):
      4     with open(os.path.join(path, file), "r") as fname:
----> 5         txt = fname.read()

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

1 个答案:

答案 0 :(得分:0)

有多种方法可以处理编码未知的文本数据。但是,在这种情况下,因为您打算将数据传递给Beautiful Soup,所以解决方案很简单:不要费心尝试自己解码文件,让Beautiful Soup来做。美丽的汤将automatically decode bytes to unicode

在当前代码中,您将以文本模式读取文件,这意味着Python将假定该文件已编码为UTF-8,除非您为open函数提供了编码参数。如果文件内容无效的UTF-8,则会导致错误。

for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

相反,请以二进制模式读取html文件,并将生成的bytes实例传递给Beautiful Soup。

for file in os.listdir(path):
    with open(os.path.join(path, file), "rb") as fname:
        bytes_ = fname.read()
soup = BeautifulSoup(bytes_)

FWIW,当前引起问题的文件可能使用cp1252或类似的Windows 8位编码进行编码。

>>> '’'.encode('cp1252')
b'\x92'