使用beautifulsoup和current.futures时,如何返回要抓取的数据?

时间:2020-07-16 18:50:49

标签: python web-scraping beautifulsoup concurrent.futures

试图异步抓取nyt烹饪的一些食谱,并关注了这个博客:https://beckernick.github.io/faster-web-scraping-python/

它将毫无问题地打印结果,但是由于某种原因,我的return在这里什么也不做。我需要返回清单。有什么想法吗?

import concurrent.futures
import time

MAX_THREADS = 30
urls = ['https://cooking.nytimes.com/search?q=&page={page_number}'.format(page_number=p) for p in range(1,5)]

# grab all of the recipe cards on each search page
def extract_recipe_urls(url):
    """returns a list of recipe urls"""
    recipe_cards = []
    response = session.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    for rs in soup.find_all("article",{"class":"card recipe-card"}):
        recipe_cards.append(rs.find('a')['href'])
    
    print(recipe_cards)
    
    return recipe_cards

def async_scraping(scrape_function, urls):
    threads = min(MAX_THREADS, len(urls))
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
        executor.map(scrape_function, urls)

1 个答案:

答案 0 :(得分:2)

你必须得到

 results = executor.map(...)

稍后您可以使用循环

for item in results:
    print(item)

或转换为列表

all_items = list(results)

顺便说一句::由于resultsgenerator,因此您不能在两个for循环中(或在{{1} }-循环和for),然后您必须首先将所有项作为列表list()获得,然后在两个all_items = list(results)-循环中使用此列表all_items


最低工作代码:

for

顺便说一句:每个import requests from bs4 import BeautifulSoup import concurrent.futures import time # --- constants --- MAX_THREADS = 30 # --- functions --- # grab all of the recipe cards on each search page def extract_recipe_urls(url): """returns a list of recipe urls""" session = requests.Session() recipe_cards = [] response = session.get(url) soup = BeautifulSoup(response.content, 'html.parser') for rs in soup.find_all("article",{"class":"card recipe-card"}): recipe_cards.append(rs.find('a')['href']) return recipe_cards def async_scraping(scrape_function, urls): threads = min(MAX_THREADS, len(urls)) with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor: results = executor.map(scrape_function, urls) return results # --- main --- urls = ['https://cooking.nytimes.com/search?q=&page={page_number}'.format(page_number=p) for p in range(1,5)] results = async_scraping(extract_recipe_urls, urls) #all_items = list(results) for item in results: print(item) 都会为您提供列表,因此最终extract_recipe_urls就是列表列表。

results

结果

all_items = list(results)
print('len(all_items):', len(all_items))
      
for item in all_items:
    print('len(item):', len(item))

如果要将所有项目都作为一个统一列表,则可以使用len(all_items): 4 len(item): 48 len(item): 48 len(item): 48 len(item): 48 list1.extend(list2),它们可以与list1 + list2一起使用

sum(..., [])

结果:

all_items = sum(all_items, [])
print('len(all_items):', len(all_items))