我有一个Scrapy项目连接到Django项目,并且一切正常(例如,当我运行刮板时,我可以将项目保存到数据库中)。
我正在尝试将图像抓取工具添加到我的项目中,但无法正常工作。我可以让Scrapy的图片抓取工具单独工作,但在连接到Django项目时却无法工作
我得到的错误如下:
File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/pipelines/media.py", line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info))
File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/pipelines/images.py", line 155, in get_media_requests
return [Request(x) for x in item.get(self.images_urls_field, [])]
File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/pipelines/images.py", line 155, in <listcomp>
return [Request(x) for x in item.get(self.images_urls_field, [])]
File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
File "/Users/junaid/Desktop/clscraper2/lib/python3.6/site-packages/scrapy/http/request/__init__.py", line 62, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
这是我的项目:
Models.py
class atl_sale_listing(models.Model):
metro_area = models.CharField(max_length=40, null=False, blank=False)
listing_id = models.CharField(max_length=250, null=False, blank=False, unique=True) #must be unique
url = models.CharField(max_length=450, null=True, blank=True)
status = models.CharField(max_length=25, null=True, blank=True)
price = models.IntegerField(null=True, blank=True)
tax = models.FloatField(null=True, blank=True)
Items.py- 注意-我将image_urls和images字段添加到django项目对象
import scrapy
from scrapy_djangoitem import DjangoItem
from realestate_app.models import atl_sale_listing
class AtlSaleListingItem(DjangoItem):
django_model = atl_sale_listing
image_urls = scrapy.Field() #added
images = scrapy.Field() #added
spider.py
import scrapy
from re_scraper.items import AtlSaleListingItem
from scrapy.loader import ItemLoader
class AtlListings2Spider(scrapy.Spider):
name = "atl_buy_testing"
allowed_domains = ["www.something.com"]
start_urls = ['www.something.com/something2',
] #specify the filter in the url
def parse(self, response):
listings = response.xpath('//div[@class="cardone "]')
order = 1
for listing in listings:
url = listing.xpath('.//a/@href').extract_first()
yield scrapy.Request(url,
callback=self.parse_listing)
def parse_listing(self, response):
status = response.xpath('//*[@class="text-orange"]/text()').extract_first()
price = response.xpath('//*[@class="price"]/text()').extract_first()
image_urls = response.xpath('//img/@data-img')[0].extract() #added image field here
yield AtlSaleListingItem(
status = status,
price = price,
image_urls = image_urls,
)
settings.py
from random import random
import os
import sys
DJANGO_PROJECT_PATH = os.path.dirname(os.path.abspath(__file__))
DJANGO_SETTINGS_MODULE = 'realestate.settings'
sys.path.append(os.path.dirname(os.path.abspath('.')))
os.environ['DJANGO_SETTINGS_MODULE'] = 'realestate.settings'
import django
django.setup()
BOT_NAME = 're_scraper'
SPIDER_MODULES = ['re_scraper.spiders']
NEWSPIDER_MODULE = 're_scraper.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
're_scraper.pipelines.AtlListingPipeline': 5,
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '/Users/user1/desktop/movoto_images'
这里的任何帮助将不胜感激
答案 0 :(得分:0)
您遇到ValueError: Missing scheme in request url: h
时出错,是由于您在媒体管道中发出的请求引起的。您所请求的网址没有方案,例如http(s)
,ftp
,file
,data
,irc
等…穷举列表{{ 3}}。
检查您从image_urls = response.xpath('//img/@data-img')[0].extract()
获取的网址。它必须是网址列表(不要忘记该方案!)