如何使用scrapy从动态生成的散列网址下载图像?

时间:2017-01-24 03:16:33

标签: scrapy scrapy-spider scrapy-pipeline

我正在使用scrapy从网站https://pixabay.com/下载图片。 我的工作代码如下 -

from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from website.imageItems import imageItems

class imageSpider(Spider):
    name = "imageCrawler"
    start_urls = ['https://pixabay.com/en/goose-bird-isolated-feather-1988657/']

    def parse(self, response):
        img = imageItems()
        image_urls = response.xpath('//div[@id="media_container"]/img/@src').extract_first()
        yield imageItems(image_urls = [image_urls])

使用此代码,我可以完美地下载图像https://cdn.pixabay.com/photo/2017/01/18/01/07/goose-1988657_960_720.png。但是,如果我修改我的代码以下载相同图像的更大尺寸,我的代码就无效了 -

def parse(self, response):
    img = imageItems()
    image_urls = 'https://pixabay.com/en/photos/download/' + response.xpath('//tr[@class="no_default"]/td/input/@value').extract_first()
    yield imageItems(image_urls = [image_urls])

在我的上一个代码图片中,网址是 -

https://pixabay.com/en/photos/download/goose-bird-isolated-feather-1988657.png

但服务器将该网址转换为某些网址 - https://pixabay.com/get/e83cb9072ef1063ecd1f4107ee4d4697e16ae3d111b4134392f3c27e/goose-1988657.png

由于hased url,我的scrpy代码无效。错误 -

2017-01-24 08:25:22 [scrapy] DEBUG: Crawled (200) <GET https://pixabay.com/en/photos/download/goose-1988657.png> (referer: None)
2017-01-24 08:25:22 [scrapy] DEBUG: File (downloaded): Downloaded file from <GET https://pixabay.com/en/photos/download/goose-1988657.png> referred in <None>
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing BmpImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing BufrStubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing CurImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing DcxImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing DdsImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing EpsImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FitsStubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FliImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FpxImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing FtexImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing GbrImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing GifImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing GribStubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing Hdf5StubImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing IcnsImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing IcoImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing ImImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing ImtImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing IptcImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing JpegImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing Jpeg2KImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing McIdasImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MicImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MpegImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MpoImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing MspImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PalmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PcdImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PcxImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PdfImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PixarImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PngImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PpmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing PsdImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing SgiImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing SpiderImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing SunImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing TgaImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing TiffImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing WebPImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing WmfImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing XbmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing XpmImagePlugin
2017-01-24 08:25:22 [PIL.Image] DEBUG: Importing XVThumbImagePlugin
2017-01-24 08:25:22 [scrapy] ERROR: File (unknown-error): Error processing file from <GET https://pixabay.com/en/photos/download/goose-1988657.png> referred in <None>
Traceback (most recent call last):
  File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\twisted\internet\defer.py", line 1185, in _inlineCallbacks
    result = g.send(result)
  File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
  File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\twisted\internet\defer.py", line 1162, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 https://pixabay.com/en/photos/download/goose-1988657.png>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\files.py", line 339, in media_downloaded
    checksum = self.file_downloaded(response, request, info)
  File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\images.py", line 64, in file_downloaded
    return self.image_downloaded(response, request, info)
  File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\images.py", line 68, in image_downloaded
    for path, image, buf in self.get_images(response, request, info):
  File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\scrapy\pipelines\images.py", line 81, in get_images
    orig_image = Image.open(BytesIO(response.body))
  File "C:\Users\maneesh_patel\Miniconda3\lib\site-packages\PIL\Image.py", line 2349, in open
    % (filename if filename else fp))
OSError: cannot identify image file <_io.BytesIO object at 0x000001D25008A888>
2017-01-24 08:25:22 [scrapy] WARNING: Dropped: Item contains no images
{'image_urls': ['https://pixabay.com/en/photos/download/goose-1988657.png']}
2017-01-24 08:25:22 [scrapy] INFO: Closing spider (finished)
2017-01-24 08:25:22 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 574,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 9486,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'file_count': 1,
 'file_status_count/downloaded': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 1, 24, 2, 55, 22, 780851),
 'item_dropped_count': 1,
 'item_dropped_reasons_count/DropItem': 1,
 'log_count/DEBUG': 48,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'log_count/WARNING': 2,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 1, 24, 2, 55, 20, 983794)}
2017-01-24 08:25:22 [scrapy] INFO: Spider closed (finished)

这不是非常具体的问题。每次服务器为任何图像生成动态URL时,scrapy都会失败。 有没有人遇到过同样的问题?

0 个答案:

没有答案