Question

我写了一个示例代码：

import requests
from bs4 import BeautifulSoup
from threading import Thread
def test():
    r = requests.get('http://zhuanlan.sina.com.cn/')
    soup = BeautifulSoup(r.content,'lxml')

print('run test on main thread')
test()

print('run test on child thread')
t = Thread(target=test)
t.start()
t.join()

输出是：

run test on main thread
run test on child thread
encoding error : input conversion failed due to input error, bytes 0x95 0x50 0x22 0x20
encoding error : input conversion failed due to input error, bytes 0x95 0x50 0x22 0x20
encoding error : input conversion failed due to input error, bytes 0x95 0x50 0x22 0x20

我编写了一个测试函数，并在主线程和子线程中运行它。如输出中所示，测试函数在子线程打印encoding error: input conversion failed due to input error中运行，但我无法阻止它。为什么会这样？

Answer 1

我遇到了这个问题，想以 @real_kappa_guy 的解决方案为基础，这是正确的想法，但可能需要更多解释。

我相信错误来自 BeautifulSoup 试图确定文档的编码。它使用一个名为“Unicode, Dammit”的库来检测编码，但文档通常不包含足够的信息来准确确定编码。这些情况会导致打印出编码错误。

修复确实是使用文档的原始编码指定 from_encoding 参数（iso-8859-1 是一个示例）。您可以从响应中以编程方式获取编码：

soup = BeautifulSoup(r.content, 'lxml', from_encoding=r.encoding)

在 BeautifulSoup 的文档 here 中有更多信息。

Answer 2

我建议这来自xml解析器...因为使用HTML解析器时错误消失了......

def test():
    r = requests.get('http://zhuanlan.sina.com.cn/')
    soup = BeautifulSoup(r.text, 'html.parser')

我得到了这个：

run test on main thread
run test on child thread

Answer 3

有点晚了，但是以防万一有人再次遇到这个问题，这对我有用：

soup = BeautifulSoup(r.content,'lxml',from_encoding="iso-8859-1")

在子线程中创建BeautifulSoup对象将打印编码错误

3 个答案: