Scrapy tumblr刮刀不保存图像

时间:2015-10-17 16:51:56

标签: python image web-crawler scrapy

我正在使用Scrapy抓取Tumblr图像。刮刀似乎能够刮掉图像的网址,但不能下载它们。

settings.py

BOT_NAME = 'tumblr'

SPIDER_MODULES = ['tumblr.spiders']
NEWSPIDER_MODULE = 'tumblr.spiders'
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
IMAGES_STORE = 'C:\Users\123\Desktop'

items.py

import scrapy


class TumblrItem(scrapy.Item):

   image_urls = scrapy.Field()
   images = scrapy.Field()

tumblr_spider

import scrapy

from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from tumblr.items import TumblrItem

class TumblrSpider(CrawlSpider):
    name = 'tumblr'
    allowed_domains = ['tumblr.com']
    start_urls = ['http://free-indie-games.tumblr.com/archive']
    rules = [Rule(LinkExtractor(allow=['/post']), 'parse_imgur')]

    def parse_imgur(self, response):
        image = TumblrItem()

        rel = response.xpath("//img/@src").extract()
        image['image_urls'] = ['http:'+rel[0]]
        return image

记录(很长,所以我会把它放在这里)

2015-10-17 17:43:59 [scrapy] DEBUG: Scraped from <200 http://free-indie-   games.tumblr.com/post/63142153501>
    {'image_urls':     [u'http:http://38.media.tumblr.com/avatar_0c4d1dcedfcd_128.png'],
     'images': []}
2015-10-17 17:44:00 [scrapy] INFO: Closing spider (finished)
2015-10-17 17:44:00 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.internet.error.ConnectError': 3,
 'downloader/request_bytes': 8356,
 'downloader/request_count': 29,
 'downloader/request_method_count/GET': 29,
 'downloader/response_bytes': 295766,
 'downloader/response_count': 26,
 'downloader/response_status_count/200': 26,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 17, 16, 44, 0, 951000),
 'item_scraped_count': 25,
 'log_count/DEBUG': 55,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'request_depth_max': 1,
 'response_received_count': 26,
 'scheduler/dequeued': 26,
 'scheduler/dequeued/memory': 26,
 'scheduler/enqueued': 26,
 'scheduler/enqueued/memory': 26,
 'start_time': datetime.datetime(2015, 10, 17, 16, 43, 58, 83000)}
2015-10-17 17:44:00 [scrapy] INFO: Spider closed (finished)

对我来说,它似乎刮擦了网址但不下载图片。至少没有任何东西出现在电脑里。

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

提取的图片网址看起来不正确 -

  

HTTP:http://38.media.tumblr.com/avatar_0c4d1dcedfcd_128.png

2015-10-17 17:43:59 [scrapy] DEBUG: Scraped from <200 http://free-indie- games.tumblr.com/post/63142153501> {'image_urls':[u'http:http://38.media.tumblr.com/avatar_0c4d1dcedfcd_128.png'], 'images': []}

为了确保安全,请尝试使用urljoin生成正确的网址,而不是执行字符串操作:

 rel = response.xpath("//img/@src").extract()
 image['image_urls'] = [urljoin(response.url, rel[0])]