使用Scrapy检索图像时的值错误

时间:2016-04-24 23:14:07

标签: scrapy scrapy-spider

我无法使用Scrapy的图像管道来检索图像。从错误报告中,我认为我正在为Scrapy提供正确的image_urls。但是,Scrapy不是从它们下载图像,而是返回错误:ValueError:请求url中缺少方案:h。

这是我第一次使用图像管道功能,所以我怀疑我犯了一个简单的错误。同样,我很感激帮助解决它。

您将在下面找到我的蜘蛛,设置,项目和错误输出。他们不是MWE,但我认为它们非常简单,易于理解。

蜘蛛:     进口scrapy     来自scrapy.spiders导入CrawlSpider,Rule     来自scrapy.linkextractors导入LinkExtractor     来自ngamedallions.items导入NgamedallionsItem     来自scrapy.loader.processors导入TakeFirst     来自scrapy.loader导入ItemLoader     来自scrapy.loader.processors导入Join     来自scrapy.http导入请求     导入重新

class NGASpider(CrawlSpider):
    name = 'ngamedallions'
    allowed_domains = ['nga.gov']
    start_urls = [
        'http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html'
    ]

    rules = (
            Rule(LinkExtractor(allow=('art-object-page.*','objects/*')),callback='parse_CatalogRecord',
    follow=True
    ),)

    def parse_CatalogRecord(self, response):
        CatalogRecord = ItemLoader(item=NgamedallionsItem(), response=response)
        CatalogRecord.default_output_processor = TakeFirst()
        keywords = "medal|medallion"
        r = re.compile('.*(%s).*' % keywords, re.IGNORECASE|re.MULTILINE|re.UNICODE)
        if r.search(response.body_as_unicode()):
            CatalogRecord.add_xpath('title', './/dl[@class="artwork-details"]/dt[@class="title"]/text()')
            CatalogRecord.add_xpath('accession', './/dd[@class="accession"]/text()')
            CatalogRecord.add_xpath('inscription', './/div[@id="inscription"]/p/text()')
            CatalogRecord.add_xpath('image_urls', './/img[@class="mainImg"]/@src')

            return CatalogRecord.load_item()

设定:

BOT_NAME = 'ngamedallions'

SPIDER_MODULES = ['ngamedallions.spiders']
NEWSPIDER_MODULE = 'ngamedallions.spiders'

DOWNLOAD_DELAY=3

ITEM_PIPELINES = {
   'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = '/home/tricia/Documents/Programing/Scrapy/ngamedallions/medallionimages'

产品:

import scrapy

class NgamedallionsItem(scrapy.Item):
    title = scrapy.Field()
    accession = scrapy.Field()
    inscription = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
    pass

错误日志:

2016-04-24 19:00:40 [scrapy] INFO: Scrapy 1.0.5.post2+ga046ce8 started (bot: ngamedallions)
2016-04-24 19:00:40 [scrapy] INFO: Optional features available: ssl, http11
2016-04-24 19:00:40 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'ngamedallions.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['ngamedallions.spiders'], 'BOT_NAME': 'ngamedallions', 'FEED_FORMAT': 'json', 'DOWNLOAD_DELAY': 3}
2016-04-24 19:00:40 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-24 19:00:40 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-24 19:00:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-24 19:00:40 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2016-04-24 19:00:40 [scrapy] INFO: Spider opened
2016-04-24 19:00:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-24 19:00:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-24 19:00:40 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: None)
2016-04-24 19:00:44 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: None)
2016-04-24 19:00:48 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:48 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.a',
 'image_urls': u'http://media.nga.gov/public/objects/1/3/1/2/1312-primary-0-440x400.jpg',
 'inscription': u'around circumference: IOHANNES FRANCISCVS GON MA; around bottom circumference: MANTVA',
 'title': u'Gianfrancesco Gonzaga di Rodigo, 1445-1496, Lord of Bozzolo, Sabbioneta, and Viadana 1478 [obverse]'}
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
  File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:48 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-04-24 19:00:51 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1313.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1312.html)
2016-04-24 19:00:52 [scrapy] ERROR: Error processing {'accession': u'1942.9.163.b',
 'image_urls': u'http://media.nga.gov/public/objects/1/3/1/3/1313-primary-0-440x400.jpg',
 'inscription': u'around top circumference: TRINACRIA IANI; upper center: PELORVS ; across center: PA LI; across bottom: BELAVRA',
 'title': u'House between Two Hills [reverse]'}
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
 File "/usr/lib/pymodules/python2.7/scrapy/pipelines/media.py", line 44, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
  File "/usr/lib/pymodules/python2.7/scrapy/pipelines/images.py", line 109, in get_media_requests
return [Request(x) for x in item.get(self.IMAGES_URLS_FIELD, [])]
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 55, in _set_url
self._set_url(url.encode(self.encoding))
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2016-04-24 19:00:55 [scrapy] DEBUG: Crawled (200) <GET http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html> (referer: http://www.nga.gov/content/ngaweb/Collection/art-object-page.1.html)
2016-04-24 19:01:02 [scrapy] INFO: Closing spider (finished)
2016-04-24 19:01:02 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1609,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 5,
 'downloader/response_bytes': 125593,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 5,
 'dupefilter/filtered': 5,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 4, 24, 23, 1, 2, 938181),
 'log_count/DEBUG': 7,
 'log_count/ERROR': 2,
 'log_count/INFO': 7,
 'request_depth_max': 2,
 'response_received_count': 5,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2016, 4, 24, 23, 0, 40, 851598)}
2016-04-24 19:01:02 [scrapy] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:1)

当TakeFirst处理器应该是一个列表时,它会使image_urls成为一个字符串。

添加:

CatalogRecord.image_urls_out = lambda v: v

编辑:

这也可能是:

CatalogRecord.image_urls_out = scrapy.loader.processors.Identity()