添加分页,KeyError:'url

时间:2018-09-11 19:42:46

标签: python web-scraping scrapy

我正在做我的第一个蜘蛛,我需要输入特定的类别,输入出版物并获取所需的数据,然后浏览该类别的各个页面,但是发现以下无法解决的错误。< / p>

当我必须添加分页时此错误开始

感谢您的帮助

item.py

进口沙皮

class ReporteinmobiliarioItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    titulo = scrapy.Field()
    precio = scrapy.Field()

    pass

spider.py

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
from scrapy.spider import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exceptions import CloseSpider
from reporteInmobiliario.items import ReporteinmobiliarioItem
from scrapy.item import Item, Field
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join
from w3lib.html import remove_tags
from scrapy import Request


class reporteInmobiliario(CrawlSpider):
    name = 'zonaprop'
    start_urls = ['https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html']

    def parse(self,response):
        for folow_url in response.css("h4.aviso-data-title>a::attr(href)").extract():
            url = response.urljoin(folow_url)
            yield Request(url,callback = self.populate_item)

    def populate_item(self,response):
        item_loader = ItemLoader(item=ReporteinmobiliarioItem(),response=response)
        item_loader.default_input_procesor = MapCompose(remove_tags)
        item_loader.add_css('titulo', 'div.card-title>h1::text')
        item_loader.add_css('precio', 'strong.venta::text')

        item_loader.add_value('url',response.url)
        yield item_loader.load.item()

    def pagination(self,response):
        next_page = response.css('h4.aviso-data-title>a::attr(href)').extract_first()
        if next_page in None:
            next_page = response.urljoin(next_page)
            return Request(next_page,callback=self.parse)

日志:

> 2018-09-11 10:28:56 [scrapy.utils.log] INFO: Scrapy 1.5.1 started
> (bot: reporteInmobiliario) 2018-09-11 10:28:56 [scrapy.utils.log]
> INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel
> 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)],
> pyOpenSSL 18.0.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.2.2,
> Platform Windows-2012ServerR2-6.3.9600-SP0 2018-09-11 10:28:56
> [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME':
> 'reporteInmobiliario', 'FEED_EXPORT_ENCODING': 'utf-8',
> 'NEWSPIDER_MODULE': 'reporteInmobiliario.spiders', 'ROBOTSTXT_OBEY':
> True, 'SPIDER_MODULES': ['reporteInmobiliario.spiders']} 2018-09-11
> 10:28:57 [scrapy.middleware] INFO: Enabled extensions:
> ['scrapy.extensions.corestats.CoreStats', 
> 'scrapy.extensions.telnet.TelnetConsole', 
> 'scrapy.extensions.logstats.LogStats'] 2018-09-11 10:28:57
> [scrapy.middleware] INFO: Enabled downloader middlewares:
> ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 
> 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
> 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
> 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
> 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
> 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
> 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
> 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
> 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
> 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
> 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
> 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-09-11
> 10:28:57 [scrapy.middleware] INFO: Enabled spider middlewares:
> ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
> 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
> 'scrapy.spidermiddlewares.referer.RefererMiddleware', 
> 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
> 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-09-11 10:28:57
> [scrapy.middleware] INFO: Enabled item pipelines:
> ['reporteInmobiliario.pipelines.JsonWriterPipeline'] 2018-09-11
> 10:28:57 [scrapy.core.engine] INFO: Spider opened 2018-09-11 10:28:57
> [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
> scraped 0 items (at 0 items/min) 2018-09-11 10:28:57
> [scrapy.extensions.telnet] DEBUG: Telnet console listening on
> 127.0.0.1:6024 2018-09-11 10:28:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zonaprop.com.ar/robots.txt> (referer: None)
> 2018-09-11 10:28:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html>
> (referer: None) 2018-09-11 10:29:00 [scrapy.core.engine] DEBUG:
> Crawled (200) <GET
> https://www.zonaprop.com.ar/propiedades/av-juan-de-garay-2800-parque-patricios-capital-43299554.html>
> (referer:
> https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html)
> 2018-09-11 10:29:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.zonaprop.com.ar/propiedades/escalada-1500-mataderos-capital-federal-31974593.html>
> (referer:
> https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html)
> 2018-09-11 10:29:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.zonaprop.com.ar/propiedades/excelente-esquina-42930524.html>
> (referer:
> https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html)
> 2018-09-11 10:29:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.zonaprop.com.ar/propiedades/local-palermo-43757972.html>
> (referer:
> https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html)
> 2018-09-11 10:29:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.zonaprop.com.ar/propiedades/lote-en-alquiler-gascon-50-1-2-cuadra-av-rivadavia-32629293.html>
> (referer:
> https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html)
> 2018-09-11 10:29:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.zonaprop.com.ar/propiedades/victor-martinez-1600-parque-chacabuco-capital-20515827.html>
> (referer:
> https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html)
> 2018-09-11 10:29:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.zonaprop.com.ar/propiedades/lote-2000-metros-en-liniers-oportunidad!-43312489.html>
> (referer:
> https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html)
> 2018-09-11 10:29:00 [scrapy.core.scraper] ERROR: Spider error
> processing <GET
> https://www.zonaprop.com.ar/propiedades/av-juan-de-garay-2800-parque-patricios-capital-43299554.html>
> (referer:
> https://www.zonaprop.com.ar/terrenos-alquiler-capital-federal.html)
> Traceback (most recent call last):   File
> "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\utils\defer.py",
> line 102, in iter_errback
>     yield next(it)   File "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py",
> line 30, in process_spider_output
>     for x in result:   File "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py",
> line 339, in <genexpr>
>     return (_set_referer(r) for r in result or ())   File "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py",
> line 37, in <genexpr>
>     return (r for r in result or () if _filter(r))   File "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py",
> line 58, in <genexpr>
>     return (r for r in result or () if _filter(r))   File "D:\Repositorio_Local\reporteInmobiliario\reporteInmobiliario\spiders\spider.py",
> line 34, in populate_item
>     item_loader.add_value('url',response.url)   File "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\loader\__init__.py",
> line 77, in add_value
>     self._add_value(field_name, value)   File "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\loader\__init__.py",
> line 91, in _add_value
>     processed_value = self._process_input_value(field_name, value)   File
> "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\loader\__init__.py",
> line 148, in _process_input_value
>     proc = self.get_input_processor(field_name)   File "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\loader\__init__.py",
> line 137, in get_input_processor
>     self.default_input_processor)   File "c:\users\ssalvadeo\appdata\local\continuum\anaconda3\lib\site-packages\scrapy\loader\__init__.py",
> line 154, in _get_item_field_attr
>     value = self.item.fields[field_name].get(key, default) KeyError: 'url

1 个答案:

答案 0 :(得分:0)

错误value = self.item.fields[field_name].get(key, default) KeyError: 'url表示您尚未在url中定义item字段。

更新如下:

class ReporteinmobiliarioItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    titulo = scrapy.Field()
    precio = scrapy.Field()
    url = scrapy.Field()

    pass