我已经构建了一个非常简单的网络抓取工具,可以在下面的网址中抓取~100个小json文件。问题是爬虫需要一个多小时才能完成。鉴于json文件有多小,我觉得很难理解。我在这里做了一些根本错误的事情吗?
def get_senate_vote(vote):
URL = 'https://www.govtrack.us/data/congress/113/votes/2013/s%d/data.json' % vote
response = requests.get(URL)
json_data = json.loads(response.text)
return json_data
def get_all_votes():
all_senate_votes = []
URL = "http://www.govtrack.us/data/congress/113/votes/2013"
response = requests.get(URL)
root = html.fromstring(response.content)
for a in root.xpath('/html/body/pre/a'):
link = a.xpath('text()')[0].strip()
if link[0] == 's':
vote = int(link[1:-1])
try:
vote_json = get_senate_vote(vote)
except:
return all_senate_votes
all_senate_votes.append(vote_json)
return all_senate_votes
vote_data = get_all_votes()
答案 0 :(得分:1)
这是一个相当简单的代码示例,我计算了每次调用所花费的时间。在我的系统上,每个请求平均2 secs
,并且有582个页面要访问,所以在19 mins
周围没有将JSON打印到控制台。在您的情况下,网络时间加上打印时间可能会增加它。
#!/usr/bin/python
import requests
import re
import time
def find_votes():
r=requests.get("https://www.govtrack.us/data/congress/113/votes/2013/")
data = r.text
votes = re.findall('s\d+',data)
return votes
def crawl_data(votes):
print("Total pages: "+str(len(votes)))
for x in votes:
url ='https://www.govtrack.us/data/congress/113/votes/2013/'+x+'/data.json'
t1=time.time()
r=requests.get(url)
json = r.json()
print(time.time()-t1)
crawl_data(find_votes())
答案 1 :(得分:1)
如果您正在使用python 3.x
而且您正在抓取多个网站,为了获得更好的效果,我热情地向您提供使用aiohttp
模块,该模块实现了asynchronous
原则。
例如:
import aiohttp
import asyncio
sites = ['url_1', 'url_2']
results = []
def save_reponse(result):
site_content = result.result()
results.append(site_content)
async def crawl_site(site):
async with aiohttp.ClientSession() as session:
async with session.get(site) as resp:
resp = await resp.text()
return resp
tasks = []
for site in sites:
task = asyncio.ensure_future(crawl_site(site))
task.add_done_callback(save_reponse)
tasks.append(task)
all_tasks = asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
loop.run_until_complete(all_tasks)
loop.close()
print(results)
有关aiohttp的更多信息。