Question

在循环中抓取多个网站时，我注意到速度之间存在相当大的差异，

sleep(10)
response = requests.get(url)

和，

response = requests.get(url, timeout=10)

也就是说，timeout要快得多。

此外，对于这两种设置，我希望在请求下一页之前每页至少10秒的抓取持续时间，但事实并非如此。

为什么速度会有这么大差异？
为什么每页的抓取持续时间少于10秒？

我现在使用多处理，但我想要记住上述保留以及非多处理。

Answer 1

time.sleep会阻止您的脚本运行一定的秒数，而timeout是等待检索该网址的最长时间。如果在timeout时间到来之前检索到数据，则会跳过剩余时间。因此，使用timeout可能需要不到10秒的时间。

time.sleep不同，它会完全暂停您的脚本，直到它完成睡眠状态，然后它会再运行您的请求几秒钟。所以time.sleep每次都需要10秒以上。

它们有非常不同的用途，但对于你的情况，你应该制作一个计时器，如果它在10秒之前完成，让程序等待。

Answer 2

response = requests.get(url, timeout=10)
# timeout specifies the maximum time program will wait for request to complete before throwing exception. It is not necessary that program will pause for 10 seconds. If response is returned early the program won't wait anymore.

详细了解requests超时here。

time.sleep导致您的主线程处于休眠状态，因此您的程序将始终在向网址发出请求之前等待10秒。

Python web scraping：sleep和request之间的区别（page，timeout = x）

2 个答案: