我有一个工作蜘蛛抓取图像网址并将它们放在scrapy.Item的image_urls字段中。我有一个继承自ImagesPipeline的自定义管道。当特定URL返回非200 http响应代码(比如说401错误)。例如,在日志文件中,我找到了
WARNING:scrapy.pipelines.files:File (code: 404): Error downloading file from <GET http://a.espncdn.com/combiner/i%3Fimg%3D/i/headshots/tennis/players/full/425.png> referred in <None>
WARNING:scrapy.pipelines.files:File (code: 307): Error downloading file from <GET http://www.fansshare.com/photos/rogerfederer/federer-roger-federer-406468306.jpg> referred in <None>
但是,我无法在item_completed()
函数的自定义图像管道中捕获错误代码 404 , 307 等:
def item_completed(self, results, item, info):
image_paths = []
for download_status, x in results:
if download_status:
image_paths.append(x['path'])
item['images'] = image_paths # update item image path
item['result_download_status'] = 1
else:
item['result_download_status'] = 0
#x.printDetailedTraceback()
logging.info(repr(x)) # x is a twisted failure object
return item
在files.py的media_downloaded()
函数内挖掘scrapy源代码,我发现对于非200响应代码,会记录一条警告(解释上面的WARNING行),然后是FileException
被提出来了。
if response.status != 200:
logger.warning(
'File (code: %(status)s): Error downloading file from '
'%(request)s referred in <%(referer)s>',
{'status': response.status,
'request': request, 'referer': referer},
extra={'spider': info.spider}
)
raise FileException('download-error')
如何访问此响应代码,以便我可以在item_completed()函数的管道中处理它?</ p>
答案 0 :(得分:1)
如果你不熟悉异步编程和Twisted回调和errbacks,那么你很容易将所有那些链接在Scrapy媒体管道中的方法混淆,所以在你的情况下,基本的想法是覆盖media_downloaded
这样的方式像这样处理非200响应(只是快速而肮脏的PoC):
class MyPipeline(ImagesPipeline):
def media_downloaded(self, response, request, info):
if response.status != 200:
return {'url': request.url, 'status': response.status}
super(MyPipeline, self).media_downloaded(response, request, info)
def item_completed(self, results, item, info):
image_paths = []
for download_status, x in results:
if download_status:
if not x.get('status', False):
# Successful download
else:
# x['status'] contains non-200 response code
答案 1 :(得分:0)
捕获非200响应代码的正确方法似乎是继承media_downloaded但是要调用父函数并捕获异常。以下是有效的代码:
def media_downloaded(self, response, request, info):
try:
resultdict = super(MyPipeline, self).media_downloaded(response, request, info)
resultdict['status'] = response.status
logging.warning('No Exception : {}'.format(response.status))
return resultdict
except FileException as exc:
logging.warning('Caused Exception : {} {}'.format(response.status, str(exc)))
return {'url': request.url, 'status': response.status}
可以在item_completed()
中处理响应代码def item_completed(self, results, item, info):
image_paths = []
for download_status, x in results:
if x.get('status', True):
item['result_download_status'] = x['status'] # contains non-200 response code
if x['status'] == 200:
image_paths.append(x['path'])
item['images'] = image_paths # update item image path