如何使用scrapy获取失败的URL

时间:2015-05-27 06:28:11

标签: python web-scraping scrapy scrapy-spider

我有两个自定义中间件来捕获失败的网址:

DOWNLOADER_MIDDLEWARES={
    'soufang.misc.middleware.CustomRecordMiddleware':860,
    'soufang.misc.middleware.CustomFailresponseMiddleware':298,
    'soufang.misc.middleware.CustomHttpProxyMiddleware':297,
}

CustomRecordMiddlware非常接近下载程序,因此它可以捕获默认中间件DownloaderStats计算的所有异常。

CustomHttpProxyMiddlewareCustomFailresponseMiddleware捕获重试后仍然失败的网址和例外情况。

以下是middleware.py

from agents import AGENTS
from usefulproxy350 import PROXIES
from scrapy import log
import random

class CustomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        agent = random.choice(AGENTS)
        request.headers['User-Agent'] = agent

class CustomHttpProxyMiddleware(object):

    def process_request(self, request, spider):
        agent = random.choice(AGENTS)
        request.headers['User-Agent'] = agent
        p = random.choice(PROXIES)
        try:
            request.meta['proxy'] = "http://%s" % p
        except Exception, e:
            log.msg("Exception %s" % e, _level=log.CRITICAL)

    def process_exception(self, request, exception, spider):
        url = request.url
        proxy = request.meta['proxy']
        myfile = open('outurl_excep.txt','a')
        myfile.write(url+'\n')
        myfile.write(proxy+'\n')
        myfile.close()
class CustomFailresponseMiddleware(object):

    def process_response(self,request,response,spider):
        try:
            if response.status != 200 or len(response.headers)==0 :
                myfile = open('outurl_respo.txt','a')
                myfile.write(response.url + '\n')
                myfile.close()
                return request
            return response
        except Exception,e:
            log.msg("Response Exception %s" % e)

class CustomRecordMiddleware(object):

    def process_exception(self,request,exception,spider):
        url = request.url
        proxy = request.meta['proxy']
        myfile = open('outurl_record.txt','a')
        myfile.write(url+'\n')
        myfile.write(proxy+'\n')
        myfile.close()
        log.msg('Fail to request url %s with exception %s' % (url, str(exception)))

似乎还有一些我没有抓到的失败网址。当我从51页爬行时,爬虫似乎在24页后停止。

以下是统计数据:

2015-05-27 13:04:15+0800 [soufang_redis] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 55,
     'downloader/exception_type_count/twisted.internet.error.ConnectError': 6,
     'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 1,
     'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 18,
     'downloader/exception_type_count/twisted.internet.error.TimeoutError': 9,
     'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 21,
     'downloader/request_bytes': 230985,
     'downloader/request_count': 582,
     'downloader/request_method_count/GET': 582,
     'downloader/response_bytes': 8174486,
     'downloader/response_count': 527,
     'downloader/response_status_count/200': 505,
     'downloader/response_status_count/400': 1,
     'downloader/response_status_count/404': 4,
     'downloader/response_status_count/502': 10,
     'downloader/response_status_count/503': 7,
     'finish_reason': 'shutdown',
     'finish_time': datetime.datetime(2015, 5, 27, 5, 4, 15, 945815),
     'item_dropped_count': 5,
     'item_dropped_reasons_count/DropItem': 5,
     'item_scraped_count': 475,
     'log_count/INFO': 82,
     'log_count/WARNING': 5,
     'request_depth_max': 24,
     'response_received_count': 505,
     'scheduler/dequeued/redis': 582,
     'scheduler/enqueued/redis': 582,
     'start_time': datetime.datetime(2015, 5, 27, 4, 47, 13, 889437)}

我查看了outurl_record.txt,记录的例外数量为55,完全等于downloader/exception_countrequest_depth_max只有24(应该是51),但我没有找到outurl_record.txt中第25页的任何失败信息。我也没有在outurl_excep.txtouturl_respo.txt中找到。

我已尝试了几次,有时它可以抓取所有页面,有时不会。

我错过了什么?

0 个答案:

没有答案