如何在Python中更快更有效地抓取多个页面

时间:2016-12-17 14:11:58

标签: python web-scraping

我刚刚编写了一些代码,这些代码会逐一抓取每个GSOC组织的页面。

目前,这种方法很好,但速度很慢。 有没有办法让它更快?另外,请提供任何其他建议来改进此代码。

var chord_layout = d3.chord()
  .padAngle(0.03)
  .sortSubgroups(d3.descending);

var groups = chord_layout(population);

1 个答案:

答案 0 :(得分:1)

不是按顺序抓取所有links[i],而是使用grequests并行抓取:

from bs4 import BeautifulSoup
import requests, sys, os
import grequests

f = open('GSOC-Organizations.txt', 'w')
r = requests.get("https://summerofcode.withgoogle.com/archive/2016/organizations/")
soup = BeautifulSoup(r.content, "html.parser")
a_tags = soup.find_all("a", {"class": "organization-card__link"})
title_heads = soup.find_all("h4", {"class": "organization-card__name"})
links,titles = [],[]
for tag in a_tags:
    links.append("https://summerofcode.withgoogle.com"+tag.get('href'))
for title in title_heads:
    titles.append(title.getText())

rs = (grequests.get(u) for u in links)

for i, resp in enumerate(grequests.map(rs)):
    print resp, resp.url
    # ... continue parsing ...