Question

我正在尝试打开一系列HTML文件，以便使用BeautifulSoup从这些文件的正文中获取文本。我有大约435个文件要运行，但是一直出现此错误。

我尝试将HTML文件转换为文本并打开文本文件，但遇到相同的错误...

path = "./Bitcoin"
for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

我想获取HTML文件的源代码，以便可以使用beautifulsoup解析它，但出现此错误

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-133-f32d00599677> in <module>
      3 for file in os.listdir(path):
      4     with open(os.path.join(path, file), "r") as fname:
----> 5         txt = fname.read()

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte

Answer 1

有多种方法可以处理编码未知的文本数据。但是，在这种情况下，因为您打算将数据传递给Beautiful Soup，所以解决方案很简单：不要费心尝试自己解码文件，让Beautiful Soup来做。美丽的汤将automatically decode bytes to unicode。

在当前代码中，您将以文本模式读取文件，这意味着Python将假定该文件已编码为UTF-8，除非您为open函数提供了编码参数。如果文件内容无效的UTF-8，则会导致错误。

for file in os.listdir(path):
    with open(os.path.join(path, file), "r") as fname:
        txt = fname.read()

相反，请以二进制模式读取html文件，并将生成的bytes实例传递给Beautiful Soup。

for file in os.listdir(path):
    with open(os.path.join(path, file), "rb") as fname:
        bytes_ = fname.read()
soup = BeautifulSoup(bytes_)

FWIW，当前引起问题的文件可能使用cp1252或类似的Windows 8位编码进行编码。

>>> '’'.encode('cp1252')
b'\x92'

UnicodeDecodeError'utf-8'编解码器无法解码位置2893上的字节0x92：无效的起始字节

1 个答案: