从Django集成的scrapy中保存图像

时间:2018-09-24 06:42:24

标签: django scrapy

我有一个Scrapy项目连接到Django项目,并且一切正常(例如,当我运行刮板时,我可以将项目保存到数据库中)。

我正在尝试将图像抓取工具添加到我的项目中,但无法正常工作。我可以让Scrapy的图片抓取工具单独工作,但在连接到Django项目时却无法工作

我得到的错误如下:

  File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/pipelines/media.py", line 79, in process_item
    requests = arg_to_iter(self.get_media_requests(item, info))
  File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/pipelines/images.py", line 155, in get_media_requests
    return [Request(x) for x in item.get(self.images_urls_field, [])]
  File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/pipelines/images.py", line 155, in <listcomp>
    return [Request(x) for x in item.get(self.images_urls_field, [])]
  File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/http/request/__init__.py", line 62, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h

这是我的项目:

Models.py

class atl_sale_listing(models.Model):
    metro_area = models.CharField(max_length=40, null=False, blank=False)
    listing_id = models.CharField(max_length=250, null=False, blank=False, unique=True)  #must be unique
    url = models.CharField(max_length=450, null=True, blank=True)
    status = models.CharField(max_length=25, null=True, blank=True)
    price = models.IntegerField(null=True, blank=True)
    tax = models.FloatField(null=True, blank=True)

Items.py- 注意-我将image_urls和images字段添加到django项目对象

import scrapy
from scrapy_djangoitem import DjangoItem
from realestate_app.models import atl_sale_listing

class AtlSaleListingItem(DjangoItem):
    django_model = atl_sale_listing

    image_urls = scrapy.Field() #added
    images =  scrapy.Field() #added

spider.py

import scrapy
from re_scraper.items import AtlSaleListingItem

from scrapy.loader import ItemLoader


class AtlListings2Spider(scrapy.Spider):
    name = "atl_buy_testing"
    allowed_domains = ["www.something.com"]
    start_urls = ['www.something.com/something2',
                  ] #specify the filter in the url

    def parse(self, response):
        listings = response.xpath('//div[@class="cardone "]')
        order = 1

        for listing in listings:
            url = listing.xpath('.//a/@href').extract_first()
            yield scrapy.Request(url,
                            callback=self.parse_listing)


    def parse_listing(self, response):

        status = response.xpath('//*[@class="text-orange"]/text()').extract_first()
        price = response.xpath('//*[@class="price"]/text()').extract_first()

        image_urls = response.xpath('//img/@data-img')[0].extract() #added image field  here

        yield AtlSaleListingItem(
            status = status,
            price = price,

            image_urls = image_urls,
            )

settings.py

from random import random
import os
import sys


DJANGO_PROJECT_PATH = os.path.dirname(os.path.abspath(__file__))
DJANGO_SETTINGS_MODULE = 'realestate.settings'

sys.path.append(os.path.dirname(os.path.abspath('.')))
os.environ['DJANGO_SETTINGS_MODULE'] = 'realestate.settings'

import django
django.setup()

BOT_NAME = 're_scraper'

SPIDER_MODULES = ['re_scraper.spiders']
NEWSPIDER_MODULE = 're_scraper.spiders'


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

ITEM_PIPELINES = {
   're_scraper.pipelines.AtlListingPipeline': 5,
   'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = '/Users/user1/desktop/movoto_images'

这里的任何帮助将不胜感激

1 个答案:

答案 0 :(得分:0)

您遇到ValueError: Missing scheme in request url: h时出错,是由于您在媒体管道中发出的请求引起的。您所请求的网址没有方案,例如http(s)ftpfiledatairc等…穷举列表{{ 3}}。

检查您从image_urls = response.xpath('//img/@data-img')[0].extract()获取的网址。它必须是网址列表(不要忘记该方案!)