使用ThreadPoolExecutor跨多个页面进行抓取

时间:2020-11-09 18:45:04

标签: python web-scraping beautifulsoup threadpoolexecutor

我需要了解在使用ThreadPoolExecutor遍历页面进行抓取时不起作用的原因:

with ThreadPoolExecutor(max_workers=10) as executor:
    with requests.Session() as req:
        fs = [executor.submit(main, req, num) for num in range(1, 2050)]
        allin = []
        for f in fs:
            f = f.result()
            if f:
                allin.extend(f)
                print("\n", allin)
       

我想在特定链接的所有页面上抓取一些信息(标题,摘要和日期)。上面的代码提交了主要派系。我在运行中没有任何错误,但是缺少新闻/页面。

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd


def main(req, num):
    r = req.get(
        website+"/pag/{}/".format(num))
    soup = BeautifulSoup(r.content, 'html.parser')
    stories = soup.select("div.story-content-pull")
    data = []
    for story in stories:
        row = []
        row.append(story.select_one('a').text)
        row.append(story.select_one('p').text.strip())
        row.append(story.select_one('time').text)
        data.append(row)
        return data

如果您可以让我知道代码中的错误,那将非常有帮助。

1 个答案:

答案 0 :(得分:0)

我认为这可能与您使用ThreadPoolExacutor的方式有关。我已经清理并简化了您的代码。另外,还有2050页,但是您的代码中缺少一页,因为range()的终止值是排他的。

尝试一下:

from concurrent.futures import ThreadPoolExecutor

import requests
from bs4 import BeautifulSoup


def main(num):
    page = requests.get("https://www.cataniatoday.it/cronaca/pag/{}/".format(num))
    stories = BeautifulSoup(page.content, 'html.parser').select("div.story-content-pull")
    return [
        [
            story.select_one('a').text,
            story.select_one('p').text.strip(),
            story.select_one('time').text,
        ] for story in stories
    ]


if __name__ == "__main__":
    with ThreadPoolExecutor(max_workers=10) as executor:
        for result in [executor.submit(main, num) for num in range(1, 2051)]:
            print(result.result())
            # do more stuff here

这会按照文章在网站上的顺序解析文章。用较小的页面范围(例如1 - 6)进行测试。