我正在使用scrapy image管道抓取一些图像,并希望从导入中删除与特定哈希匹配的图像。
MyImagesPipeline(ImagesPipeline)类:
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield scrapy.Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
item['images'] = response.xpath('//meta[@property="og:image"][not(contains(@content, "Demo_600x600"))]/@content').extract()[0:self.max_pix]
图片:
url "https://www.example.de…212-B726-757P-A20D-1.jpg"
path "full/56de72acb6c1e12ffa8644c1bb96df4edf421438.jpg"
checksum "e206446c40c22cfd5f94966c337b56cc"
如何确定此图像将不被导入?
答案 0 :(得分:1)
您可以尝试从imagepipeline覆盖get_images方法。如果哈希匹配,将无法下载图片。
import logging
from io import BytesIO
from scrapy.utils.misc import md5sum
logger = logging.getLogger(__name__)
def get_images(self, response, request, info):
checksum = md5sum(BytesIO(response.body))
drop_list = ['hash1', 'hash2']
logger.debug('Verifying Checksum: {}'.format(checksum))
if checksum in drop_list:
logger.debug('Skipping Checksum: {}'.format(checksum))
raise Exception('Dropping Image')
return super(MyImagesPipeline,self).get_images(response, request, info)