Question

我正试图抓住this网站。我设法通过使用urllib和beautifulsoup来做到这一点。但是urllib太慢了。我希望有异步请求，因为网址是数千。我发现一个很好的包是grequests。

示例：

Parser Error Message: The specified cryptographic algorithm is not supported on this platform.

问题是我不知道如何继续使用beautifulsoup。以便获取每个页面的html代码。听听你的想法会很高兴。谢谢！

Answer 1

您可以遍历Response列表中的a个对象，并使用text解析BeautifulSoup：

for response in a : 
    soup = BeautifulSoup(response.text, 'html.parser')
    ...

Answer 2

参考下面的脚本，同时检查源的链接。它会有所帮助。

reqs = (grequests.get(link) for link in links)
resp=grequests.imap(reqs, grequests.Pool(10))
 
for r in resp:
   soup = BeautifulSoup(r.text, 'lxml')
   results = soup.find_all('a', attrs={"class":'product__list-name'})
   print(results[0].text)
   prices = soup.find_all('span', attrs={'class':"pdpPriceMrp"})
   print(prices[0].text)
   discount = soup.find_all("div", attrs={"class":"listingDiscnt"})
   print(discount[0].text)

来源：https://blog.datahut.co/asynchronous-web-scraping-using-python/

使用Python进行异步抓取：grequests和Beautifulsoup4

2 个答案: