scrapy不下载图像

时间:2014-01-03 10:40:59

标签: web-scraping jpeg web-crawler scrapy

我正试图通过scrapy从不同的网址下载图片。我是python和scrapy的新手,所以也许我错过了一些明显的东西。这是我关于堆栈溢出的第一篇文章。非常感谢帮助!

以下是我的不同文件:

items.py

from scrapy.item import Item, Field

class ImagesTestItem(Item):
    image_urls = Field()
    image_names =Field()
    images = Field()
    pass

setting.py:

from scrapy import log

log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)

BOT_NAME = 'images_test'

SPIDER_MODULES = ['images_test.spiders']
NEWSPIDER_MODULE = 'images_test.spiders'
ITEM_PIPELINES = {'images_test.pipelines.images_test': 1}
IMAGES_STORE = '/Users/Coralie/Documents/scrapy/images_test/images'
DOWNLOAD_DELAY = 5
STATS_CLASS = True

蜘蛛:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item,Field
from scrapy.utils.response import get_base_url
import logging
from scrapy.log import ScrapyFileLogObserver

logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()

class images_test(CrawlSpider):
    name = "images_test"
    allowed_domains = ['veranstaltungszentrum.bbaw.de']
    start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0%d_g.jpg' % i for i in xrange(9)  ]

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        sites = hxs.select()
        number = 0
        for site in sites:    
            xpath = '//img/@src'
            image_urls = hxs.select('//img/@src').extract()
            item['image_urls'] = ["http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0x_g.jpg" + x for x in image_urls]
            items.append(item)
            number = number + 1
            return item

        print item['image_urls']

pipelines.py

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from PIL import Image 
from scrapy import log

log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)
scrapy.log.ERROR

class images_test(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

日志说的如下:

/Library/Python/2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users@googlegroups.com for alternatives):
    STATS_ENABLED: no longer supported (change STATS_CLASS instead)
  warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-03 11:36:48+0100 [scrapy] INFO: Scrapy 0.20.2 started (bot: images_test)
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'images_test.spiders', 'SPIDER_MODULES': ['images_test.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'images_test'}
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-03 11:36:49+0100 [scrapy] WARNING: This is a warning
2014-01-03 11:36:49+0100 [scrapy] ERROR: This is a error
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled item pipelines: images_test
2014-01-03 11:36:49+0100 [images_test] INFO: Spider opened
2014-01-03 11:36:49+0100 [images_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-03 11:36:49+0100 [images_test] DEBUG: Crawled (404) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib00_g.jpg> (referer: None)
2014-01-03 11:36:55+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib01_g.jpg> (referer: None)
2014-01-03 11:36:59+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib02_g.jpg> (referer: None)
2014-01-03 11:37:05+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib03_g.jpg> (referer: None)
2014-01-03 11:37:10+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib04_g.jpg> (referer: None)
2014-01-03 11:37:16+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib05_g.jpg> (referer: None)
2014-01-03 11:37:22+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib06_g.jpg> (referer: None)
2014-01-03 11:37:29+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib07_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib08_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] INFO: Closing spider (finished)
2014-01-03 11:37:36+0100 [images_test] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2376,
     'downloader/request_count': 9,
     'downloader/request_method_count/GET': 9,
     'downloader/response_bytes': 343660,
     'downloader/response_count': 9,
     'downloader/response_status_count/200': 8,
     'downloader/response_status_count/404': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 1, 3, 10, 37, 36, 166139),
     'log_count/DEBUG': 15,
     'log_count/ERROR': 1,
     'log_count/INFO': 3,
     'log_count/WARNING': 1,
     'response_received_count': 9,
     'scheduler/dequeued': 9,
     'scheduler/dequeued/memory': 9,
     'scheduler/enqueued': 9,
     'scheduler/enqueued/memory': 9,
     'start_time': datetime.datetime(2014, 1, 3, 10, 36, 49, 37947)}
2014-01-03 11:37:36+0100 [images_test] INFO: Spider closed (finished)

为什么图像没有保存?甚至我的打印项['image_urls']命令也没有被执行。

谢谢

2 个答案:

答案 0 :(得分:2)

考虑将您的蜘蛛代码更改为以下内容:

start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery']

def parse(self, response):
        sel = HtmlXPathSelector(response)
        item = ImagesTestItem()
        url = 'http://veranstaltungszentrum.bbaw.de'
        return item['image_urls'] = [urljoin(url, x) for x in 
                                             sel.select('//img/@src').extract())]

HtmlXPathSelector只能解析html文档,似乎你用start_urls

中的图像来提供它

答案 1 :(得分:0)

您可以尝试不使用花键线:

def parse(self,response):
   #extract your images url
   imageurl = response.xpath("//img/@src").get()
   imagename = imageurl.split("/")[-1]
   req = urllib.request.Request(imageurl, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'})
   resource = urllib.request.urlopen(req)
   output = open("foldername/"+imagename,"wb")
   output.write(resource.read())
   output.close()