Question

我正在尝试创建一个网络爬虫（在python中，使用scrapy），以从广告中提取信息，提取首页上的内容，然后输入同一广告的子页面，然后提取其余信息，但是运行代码时给出此错误。有什么建议吗？

import scrapy

class SapoSpider(scrapy.Spider):
name = "imo"
start_urls = ['https://www.imovirtual.com/comprar/apartamento/lisboa/']

def parse(self,response):
    for Property in response.css('div.offer-item-details'):
        youritem = {
        'preco':Property.css('span.offer-item title::text').extract_first(),
        'autor':Property.css('li.offer-item-price::text').extract(),
        'data':Property.css('li.offer-item-area::text').extract(),
        'data_2':Property.css('li.offer-item-price-perm::text').extract()
        }
        yield scrapy.Request(subpage_link, callback=self.parse_subpage)

#            next_page = response.css('li.pager-next a::attr(href)').extract_first()
#            if next_page is not None:
#                next_page = response.urljoin(next_page)
#                yield scrapy.Request(next_page, callback=self.parse)

def parse_subpage(self,youritem):
    for i in response.css('header[class=offer-item-header] a::attr(href)'):
        youritem = {
        'info': i.css('ul.main-list::text').extract(),
        }
        yield youritem

Answer 1

要使其运行，需要进行一些更改：

您必须设置subpage_link（似乎没有定义）

请求回调只有一个参数（Scrapy response），因此您应将parse_subpage(self,youritem)替换为parse_subpage(self, reponse)

要发送带有“请求”的商品，最好使用“请求”元参数，该参数可让您将数据从一个拼凑的响应转移到另一个。如果您将scrapy.Request(subpage_link, callback=self.parse_subpage)替换为scrapy.Request(subpage_link, callback=self.parse_subpage, meta={'item': youritem})，则当scrapy通过执行youritem调用parse_subpage时，您将可以访问response.meta.get('item')

这应该有效。

def parse(self,response):
    for Property in response.css('div.offer-item-details'):
        youritem = {
        'preco':Property.css('span.offer-item title::text').extract_first(),
        'autor':Property.css('li.offer-item-price::text').extract(),
        'data':Property.css('li.offer-item-area::text').extract(),
        'data_2':Property.css('li.offer-item-price-perm::text').extract()
        }
        subpage_link = ...... 
        yield scrapy.Request(subpage_link, callback=self.parse_subpage,
                             meta={'item': youritem})

#            next_page = response.css('li.pager-next a::attr(href)').extract_first()
#            if next_page is not None:
#                next_page = response.urljoin(next_page)
#                yield scrapy.Request(next_page, callback=self.parse)


def parse_subpage(self, response):
    for i in response.css('header[class=offer-item-header] a::attr(href)'):
        youritem = response.meta.get('item')
        youritem['info'] = i.css('ul.main-list::text').extract()
        yield youritem

错误：蜘蛛错误处理<get https：=“” www.imovirtual.com =“” comprar =“” Apartamento =“” lisboa =“”>（参考：无）

1 个答案: