错误:蜘蛛错误处理<get https:=“” www.imovirtual.com =“” comprar =“” Apartamento =“” lisboa =“”>(参考:无)

时间:2018-06-20 14:36:58

标签: python css scrapy web-crawler

我正在尝试创建一个网络爬虫(在python中,使用scrapy),以从广告中提取信息,提取首页上的内容,然后输入同一广告的子页面,然后提取其余信息,但是运行代码时给出此错误。有什么建议吗?

import scrapy

class SapoSpider(scrapy.Spider):
name = "imo"
start_urls = ['https://www.imovirtual.com/comprar/apartamento/lisboa/']

def parse(self,response):
    for Property in response.css('div.offer-item-details'):
        youritem = {
        'preco':Property.css('span.offer-item title::text').extract_first(),
        'autor':Property.css('li.offer-item-price::text').extract(),
        'data':Property.css('li.offer-item-area::text').extract(),
        'data_2':Property.css('li.offer-item-price-perm::text').extract()
        }
        yield scrapy.Request(subpage_link, callback=self.parse_subpage)

#            next_page = response.css('li.pager-next a::attr(href)').extract_first()
#            if next_page is not None:
#                next_page = response.urljoin(next_page)
#                yield scrapy.Request(next_page, callback=self.parse)

def parse_subpage(self,youritem):
    for i in response.css('header[class=offer-item-header] a::attr(href)'):
        youritem = {
        'info': i.css('ul.main-list::text').extract(),
        }
        yield youritem

1 个答案:

答案 0 :(得分:0)

要使其运行,需要进行一些更改:

您必须设置subpage_link(似乎没有定义)

请求回调只有一个参数(Scrapy response),因此您 应将parse_subpage(self,youritem)替换为parse_subpage(self, reponse)

要发送带有“请求”的商品,最好使用“请求”元参数,该参数可让您将数据从一个拼凑的响应转移到另一个。如果您将scrapy.Request(subpage_link, callback=self.parse_subpage)替换为scrapy.Request(subpage_link, callback=self.parse_subpage, meta={'item': youritem}),则当scrapy通过执行youritem调用parse_subpage时,您将可以访问response.meta.get('item')

这应该有效。

def parse(self,response):
    for Property in response.css('div.offer-item-details'):
        youritem = {
        'preco':Property.css('span.offer-item title::text').extract_first(),
        'autor':Property.css('li.offer-item-price::text').extract(),
        'data':Property.css('li.offer-item-area::text').extract(),
        'data_2':Property.css('li.offer-item-price-perm::text').extract()
        }
        subpage_link = ...... 
        yield scrapy.Request(subpage_link, callback=self.parse_subpage,
                             meta={'item': youritem})

#            next_page = response.css('li.pager-next a::attr(href)').extract_first()
#            if next_page is not None:
#                next_page = response.urljoin(next_page)
#                yield scrapy.Request(next_page, callback=self.parse)


def parse_subpage(self, response):
    for i in response.css('header[class=offer-item-header] a::attr(href)'):
        youritem = response.meta.get('item')
        youritem['info'] = i.css('ul.main-list::text').extract()
        yield youritem