我有两个自定义中间件来捕获失败的网址:
DOWNLOADER_MIDDLEWARES={
'soufang.misc.middleware.CustomRecordMiddleware':860,
'soufang.misc.middleware.CustomFailresponseMiddleware':298,
'soufang.misc.middleware.CustomHttpProxyMiddleware':297,
}
CustomRecordMiddlware
非常接近下载程序,因此它可以捕获默认中间件DownloaderStats
计算的所有异常。
CustomHttpProxyMiddleware
和CustomFailresponseMiddleware
捕获重试后仍然失败的网址和例外情况。
以下是middleware.py
:
from agents import AGENTS
from usefulproxy350 import PROXIES
from scrapy import log
import random
class CustomUserAgentMiddleware(object):
def process_request(self, request, spider):
agent = random.choice(AGENTS)
request.headers['User-Agent'] = agent
class CustomHttpProxyMiddleware(object):
def process_request(self, request, spider):
agent = random.choice(AGENTS)
request.headers['User-Agent'] = agent
p = random.choice(PROXIES)
try:
request.meta['proxy'] = "http://%s" % p
except Exception, e:
log.msg("Exception %s" % e, _level=log.CRITICAL)
def process_exception(self, request, exception, spider):
url = request.url
proxy = request.meta['proxy']
myfile = open('outurl_excep.txt','a')
myfile.write(url+'\n')
myfile.write(proxy+'\n')
myfile.close()
class CustomFailresponseMiddleware(object):
def process_response(self,request,response,spider):
try:
if response.status != 200 or len(response.headers)==0 :
myfile = open('outurl_respo.txt','a')
myfile.write(response.url + '\n')
myfile.close()
return request
return response
except Exception,e:
log.msg("Response Exception %s" % e)
class CustomRecordMiddleware(object):
def process_exception(self,request,exception,spider):
url = request.url
proxy = request.meta['proxy']
myfile = open('outurl_record.txt','a')
myfile.write(url+'\n')
myfile.write(proxy+'\n')
myfile.close()
log.msg('Fail to request url %s with exception %s' % (url, str(exception)))
似乎还有一些我没有抓到的失败网址。当我从51页爬行时,爬虫似乎在24页后停止。
以下是统计数据:
2015-05-27 13:04:15+0800 [soufang_redis] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 55,
'downloader/exception_type_count/twisted.internet.error.ConnectError': 6,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 1,
'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 18,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 9,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 21,
'downloader/request_bytes': 230985,
'downloader/request_count': 582,
'downloader/request_method_count/GET': 582,
'downloader/response_bytes': 8174486,
'downloader/response_count': 527,
'downloader/response_status_count/200': 505,
'downloader/response_status_count/400': 1,
'downloader/response_status_count/404': 4,
'downloader/response_status_count/502': 10,
'downloader/response_status_count/503': 7,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2015, 5, 27, 5, 4, 15, 945815),
'item_dropped_count': 5,
'item_dropped_reasons_count/DropItem': 5,
'item_scraped_count': 475,
'log_count/INFO': 82,
'log_count/WARNING': 5,
'request_depth_max': 24,
'response_received_count': 505,
'scheduler/dequeued/redis': 582,
'scheduler/enqueued/redis': 582,
'start_time': datetime.datetime(2015, 5, 27, 4, 47, 13, 889437)}
我查看了outurl_record.txt
,记录的例外数量为55,完全等于downloader/exception_count
。 request_depth_max
只有24(应该是51),但我没有找到outurl_record.txt
中第25页的任何失败信息。我也没有在outurl_excep.txt
和outurl_respo.txt
中找到。
我已尝试了几次,有时它可以抓取所有页面,有时不会。
我错过了什么?