我刚刚编写了一些代码,这些代码会逐一抓取每个GSOC组织的页面。
目前,这种方法很好,但速度很慢。 有没有办法让它更快?另外,请提供任何其他建议来改进此代码。
var chord_layout = d3.chord()
.padAngle(0.03)
.sortSubgroups(d3.descending);
var groups = chord_layout(population);
答案 0 :(得分:1)
不是按顺序抓取所有links[i]
,而是使用grequests并行抓取:
from bs4 import BeautifulSoup
import requests, sys, os
import grequests
f = open('GSOC-Organizations.txt', 'w')
r = requests.get("https://summerofcode.withgoogle.com/archive/2016/organizations/")
soup = BeautifulSoup(r.content, "html.parser")
a_tags = soup.find_all("a", {"class": "organization-card__link"})
title_heads = soup.find_all("h4", {"class": "organization-card__name"})
links,titles = [],[]
for tag in a_tags:
links.append("https://summerofcode.withgoogle.com"+tag.get('href'))
for title in title_heads:
titles.append(title.getText())
rs = (grequests.get(u) for u in links)
for i, resp in enumerate(grequests.map(rs)):
print resp, resp.url
# ... continue parsing ...