Question

我正在尝试抓取大量数据并遍历许多页面。当循环中的页面超过5个左右时，我会不断收到超时错误。

### Libraries ###

import pyppdf.patch_pyppeteer
from bs4 import BeautifulSoup, SoupStrainer
import requests
from requests_html import HTMLSession
import pandas as pd
from time import sleep
import urllib.request

### SCRAPING PAGES 1-15

all_beer_info = []
for i in range(1,16):
    url = 'https://specsonline.com/product-category/beer/page/{}/'.format(i)
    print(url)
    session = HTMLSession()  
    resp = session.get(url, timeout=None)
    site = resp.html.render() #RENDERS INCASE ITS JAVASCRIPT SITE
    soup = BeautifulSoup(resp.html.html, features='lxml')
    beer_info = soup.select('.woocommerce-loop-product__title')
    for b in beer_info:
        results = (b.text)
        all_beer_info.append(results)
    sleep(randint(2,5))

有时这将在整个脚本中运行而不会出现错误，有时它会返回以下内容： pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 8000 ms exceeded.

我有四个不同的循环，例如上面的一个，分别是：名称，价格，产品尺寸等。

如果它经过一两个循环，它将在代码完成之前超时，从而使该时间无效。有没有更有效的方式来运行此代码并合并数据？任何信息将不胜感激。

Answer 1

问题是，对于具有所有js内容的“完全渲染”，某些页面需要8秒钟以上的时间，您会收到PerformSegue('namedSegue', this)。 TimeoutError的默认超时为8秒。尝试类似resp.html.render()

为什么我的脚本超时？ -BeautifulSoup超时错误

1 个答案: