Question

.html保存到本地磁盘，并且我正在使用BeautifulSoup（bs4）对其进行解析。

一切正常，直到最近将其更改为Python 3。

我在另一台机器Python 2中测试了相同的.html文件，它可以正常工作并返回页面内容。

soup = BeautifulSoup(open('page.html'), "lxml")

使用Python 3的机器不起作用，它说：

UnicodeDecodeError: 'gbk' codec can't decode byte 0x92 in position 298670: illegal multibyte sequence

搜索了一下，我尝试了以下操作，但均无济于事：（无论是'r'还是'rb'都没什么大不同）

soup = BeautifulSoup(open('page.html', 'r'), "lxml")
soup = BeautifulSoup(open('page.html', 'r'), 'html.parser')
soup = BeautifulSoup(open('page.html', 'r'), 'html5lib')
soup = BeautifulSoup(open('page.html', 'r'), 'xml')

如何使用Python 3解析此html页面？

谢谢。

Answer 1

一切正常，直到最近将其更改为Python 3。

Python 3默认具有以unicode编码的字符串，因此当您将文件打开为文本时，它将尝试对其进行解码。另一方面，Python 2使用字节字符串，而是仅按原样返回文件的内容。尝试将page.html作为字节对象（open('page.html', 'rb')）打开，看看是否适合您。

Answer 2

我做了2项更改，不确定哪个（或两个）生效了。

计算机已格式化并重新安装，因此某些设置有所不同。

1。在语言设置中，

Administrative language settings > Change system locale >

勾选方框

Beta: Use Unicode UTF-8 for worldwide language support

2。在编码上，例如，这是原始行：

print (soup.find_all('span', attrs={'class': 'listing-row__price'})[0].text.strip().encode("utf-8"))

删除“ .encode（“ utf-8”）”部分后，它开始工作。

2019年10月16日更新上面的更改有效，但是在方框中打勾。外语软件中的字体和文本无法正确显示。
```
Beta: Use Unicode UTF-8 for worldwide language support
```

取消选中此框时，外语软件中的字体和文本会很好地显示。但是，问题仍然存在。

未打勾的解决方案-外语软件和Python代码均可使用：

soup = BeautifulSoup(open(pages, 'r', encoding = 'utf-8', errors='ignore'), "lxml")

Python 3时，BeautifulSoup出现“非法多字节序列”错误

2 个答案: