我正在尝试创建一个网络爬虫(在python中,使用scrapy),以从广告中提取信息,提取首页上的内容,然后输入同一广告的子页面,然后提取其余信息,但是运行代码时给出此错误。有什么建议吗?
import scrapy
class SapoSpider(scrapy.Spider):
name = "imo"
start_urls = ['https://www.imovirtual.com/comprar/apartamento/lisboa/']
def parse(self,response):
for Property in response.css('div.offer-item-details'):
youritem = {
'preco':Property.css('span.offer-item title::text').extract_first(),
'autor':Property.css('li.offer-item-price::text').extract(),
'data':Property.css('li.offer-item-area::text').extract(),
'data_2':Property.css('li.offer-item-price-perm::text').extract()
}
yield scrapy.Request(subpage_link, callback=self.parse_subpage)
# next_page = response.css('li.pager-next a::attr(href)').extract_first()
# if next_page is not None:
# next_page = response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
def parse_subpage(self,youritem):
for i in response.css('header[class=offer-item-header] a::attr(href)'):
youritem = {
'info': i.css('ul.main-list::text').extract(),
}
yield youritem
答案 0 :(得分:0)
要使其运行,需要进行一些更改:
您必须设置subpage_link
(似乎没有定义)
请求回调只有一个参数(Scrapy response),因此您
应将parse_subpage(self,youritem)
替换为parse_subpage(self, reponse)
要发送带有“请求”的商品,最好使用“请求”元参数,该参数可让您将数据从一个拼凑的响应转移到另一个。如果您将scrapy.Request(subpage_link, callback=self.parse_subpage)
替换为scrapy.Request(subpage_link, callback=self.parse_subpage, meta={'item': youritem})
,则当scrapy通过执行youritem
调用parse_subpage
时,您将可以访问response.meta.get('item')
这应该有效。
def parse(self,response):
for Property in response.css('div.offer-item-details'):
youritem = {
'preco':Property.css('span.offer-item title::text').extract_first(),
'autor':Property.css('li.offer-item-price::text').extract(),
'data':Property.css('li.offer-item-area::text').extract(),
'data_2':Property.css('li.offer-item-price-perm::text').extract()
}
subpage_link = ......
yield scrapy.Request(subpage_link, callback=self.parse_subpage,
meta={'item': youritem})
# next_page = response.css('li.pager-next a::attr(href)').extract_first()
# if next_page is not None:
# next_page = response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
def parse_subpage(self, response):
for i in response.css('header[class=offer-item-header] a::attr(href)'):
youritem = response.meta.get('item')
youritem['info'] = i.css('ul.main-list::text').extract()
yield youritem