Question

我已经构建了一个非常简单的网络抓取工具，可以在下面的网址中抓取~100个小json文件。问题是爬虫需要一个多小时才能完成。鉴于json文件有多小，我觉得很难理解。我在这里做了一些根本错误的事情吗？

def get_senate_vote(vote):
    URL = 'https://www.govtrack.us/data/congress/113/votes/2013/s%d/data.json' % vote
    response = requests.get(URL)
    json_data = json.loads(response.text)
    return json_data

def get_all_votes():
    all_senate_votes = []
    URL = "http://www.govtrack.us/data/congress/113/votes/2013"    
    response = requests.get(URL)           
    root = html.fromstring(response.content)
    for a in root.xpath('/html/body/pre/a'):
        link = a.xpath('text()')[0].strip()
        if link[0] == 's':
            vote = int(link[1:-1])
            try:
                vote_json = get_senate_vote(vote)
            except:
                return all_senate_votes
            all_senate_votes.append(vote_json)

    return all_senate_votes

vote_data = get_all_votes()

Answer 1

这是一个相当简单的代码示例，我计算了每次调用所花费的时间。在我的系统上，每个请求平均2 secs，并且有582个页面要访问，所以在19 mins周围没有将JSON打印到控制台。在您的情况下，网络时间加上打印时间可能会增加它。

#!/usr/bin/python

import requests
import re
import time
def find_votes():
    r=requests.get("https://www.govtrack.us/data/congress/113/votes/2013/")
    data = r.text
    votes = re.findall('s\d+',data)
    return votes

def crawl_data(votes):
    print("Total pages: "+str(len(votes)))
    for x in votes:
        url ='https://www.govtrack.us/data/congress/113/votes/2013/'+x+'/data.json'
        t1=time.time()
        r=requests.get(url)
        json = r.json()
        print(time.time()-t1)
crawl_data(find_votes())

Answer 2

如果您正在使用python 3.x而且您正在抓取多个网站，为了获得更好的效果，我热情地向您提供使用aiohttp模块，该模块实现了asynchronous原则。例如：

import aiohttp
import asyncio

sites = ['url_1', 'url_2']
results = []

def save_reponse(result):
    site_content = result.result()
    results.append(site_content)

async def crawl_site(site):
    async with aiohttp.ClientSession() as session:
        async with session.get(site) as resp:
            resp = await resp.text()
            return resp

tasks = []
for site in sites:
    task = asyncio.ensure_future(crawl_site(site))
    task.add_done_callback(save_reponse)
    tasks.append(task)
all_tasks = asyncio.gather(*tasks)

loop = asyncio.get_event_loop()
loop.run_until_complete(all_tasks)
loop.close()

print(results)

有关aiohttp的更多信息。

简单的网络爬虫非常慢

2 个答案: