用scrapy刮擦XKCD的图像

时间:2014-12-18 09:16:59

标签: python image python-2.7 web-scraping scrapy

我正在试图抓取xkcd.com来检索他们可用的所有图像。当我运行我的刮刀时,它会在www.xkcd.com/1-1461的范围内下载7-8个随机图像。我希望它能连续浏览每一页并保存图像以确保我有一套完整的设置。

下面是我写的爬行蜘蛛和我从scrapy收到的输出:

蜘蛛:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from xkcd.items import XkcdItem

class XkcdimagesSpider(CrawlSpider):
    name = "xkcdimages"
    allowed_domains = ["xkcd.com"]
    start_urls = ['http://www.xkcd.com']
    rules = [Rule(LinkExtractor(allow=['\d+']), 'parse_xkcd')]

    def parse_xkcd(self, response):
        image = XkcdItem()
        image['title'] = response.xpath(\
            "//div[@id='ctitle']/text()").extract()
        image['image_urls'] = response.xpath(\
            "//div[@id='comic']/img/@src").extract()
        return image

输出

2014-12-18 19:57:42+1300 [scrapy] INFO: Scrapy 0.24.4 started (bot: xkcd)
2014-12-18 19:57:42+1300 [scrapy] INFO: Optional features available: ssl, http11, django
2014-12-18 19:57:42+1300 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xkcd.spiders', 'SPIDER_MODULES': ['xkcd.spiders'], 'DOWNLOAD_DELAY': 1, 'BOT_NAME': 'xkcd'}
2014-12-18 19:57:42+1300 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2014-12-18 19:57:43+1300 [xkcdimages] INFO: Spider opened
2014-12-18 19:57:43+1300 [xkcdimages] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-18 19:57:43+1300 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-18 19:57:43+1300 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com> (referer: None)
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Filtered offsite request to 'creativecommons.org': <GET http://creativecommons.org/licenses/by-nc/2.5/>
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://xkcd.com/1461/large/> (referer: http://www.xkcd.com)
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Scraped from <200 http://xkcd.com/1461/large/>
    {'image_urls': [], 'images': [], 'title': []}
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/1/> (referer: http://www.xkcd.com)
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg> referred in <None>
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/1/>
    {'image_urls': [u'http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg'],
     'images': [{'checksum': '953bf3bf4584c2e347eaaba9e4703c9d',
                 'path': 'full/ab31199b91c967a29443df3093fac9c97e5bbed6.jpg',
                 'url': 'http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg'}],
     'title': [u'Barrel - Part 1']}
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/556/> (referer: http://www.xkcd.com)
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg> referred in <None>
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/556/>
    {'image_urls': [u'http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg'],
     'images': [{'checksum': 'c88a6e5a3018bce48861bfe2a2255993',
                 'path': 'full/b523e12519a1735f1d2c10cb8b803e0a39bf90e5.jpg',
                 'url': 'http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg'}],
     'title': [u'Alternative Energy Revolution']}
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/688/> (referer: http://www.xkcd.com)
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/self_description.png> referred in <None>
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/688/>
    {'image_urls': [u'http://imgs.xkcd.com/comics/self_description.png'],
     'images': [{'checksum': '230b38d12d5650283dc1cc8a7f81469b',
                 'path': 'full/e754ff4560918342bde8f2655ff15043e251f25a.jpg',
                 'url': 'http://imgs.xkcd.com/comics/self_description.png'}],
     'title': [u'Self-Description']}
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/162/> (referer: http://www.xkcd.com)
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/angular_momentum.jpg> referred in <None>
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/162/>
    {'image_urls': [u'http://imgs.xkcd.com/comics/angular_momentum.jpg'],
     'images': [{'checksum': '83050c0cc9f4ff271a9aaf52372aeb33',
                 'path': 'full/7c180399f2a2cffeb321c071dea2c669d83ca328.jpg',
                 'url': 'http://imgs.xkcd.com/comics/angular_momentum.jpg'}],
     'title': [u'Angular Momentum']}
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/730/> (referer: http://www.xkcd.com)
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/circuit_diagram.png> referred in <None>
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/730/>
    {'image_urls': [u'http://imgs.xkcd.com/comics/circuit_diagram.png'],
     'images': [{'checksum': 'd929f36d6981cb2825b25c9a8dac7c9e',
                 'path': 'full/15ad254b5cd5c506d701be67f525093af79e6ac0.jpg',
                 'url': 'http://imgs.xkcd.com/comics/circuit_diagram.png'}],
     'title': [u'Circuit Diagram']}
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/150/> (referer: http://www.xkcd.com)
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/grownups.png> referred in <None>
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/150/>
    {'image_urls': [u'http://imgs.xkcd.com/comics/grownups.png'],
     'images': [{'checksum': '9d165fd0b00ec88bcc953da19d52a3d3',
                 'path': 'full/57fdec7b0d3b2c0a146ea77937c776994f631a4a.jpg',
                 'url': 'http://imgs.xkcd.com/comics/grownups.png'}],
     'title': [u'Grownups']}
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/1460/> (referer: http://www.xkcd.com)
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/smfw.png> referred in <None>
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/1460/>
    {'image_urls': [u'http://imgs.xkcd.com/comics/smfw.png'],
     'images': [{'checksum': '705b029ffbdb7f2306ccb593426392fd',
                 'path': 'full/93805911ad95e7f5c2f93a6873a2ae36c0d00f86.jpg',
                 'url': 'http://imgs.xkcd.com/comics/smfw.png'}],
     'title': [u'SMFW']}
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Closing spider (finished)
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2173,
     'downloader/request_count': 9,
     'downloader/request_method_count/GET': 9,
     'downloader/response_bytes': 26587,
     'downloader/response_count': 9,
     'downloader/response_status_count/200': 9,
     'file_count': 7,
     'file_status_count/uptodate': 7,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 12, 18, 6, 57, 52, 133428),
     'item_scraped_count': 8,
     'log_count/DEBUG': 27,
     'log_count/INFO': 7,
     'offsite/domains': 1,
     'offsite/filtered': 1,
     'request_depth_max': 1,
     'response_received_count': 9,
     'scheduler/dequeued': 9,
     'scheduler/dequeued/memory': 9,
     'scheduler/enqueued': 9,
     'scheduler/enqueued/memory': 9,
     'start_time': datetime.datetime(2014, 12, 18, 6, 57, 43, 153440)}
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:3)

您需要在crawling rules中设置follow参数True。尝试这样的事情:

linkextractor = LinkExtractor(allow=('\d+'), unique=True)
rules = [Rule(linkextractor, callback='parse_xkcd', follow=True)]