我正在使用python开发网络抓取工具,并且正在使用Scrapy。对于这个网络爬虫,我需要找到一个网站,其中包含广告的主页和需要获取信息的子页面。我怎么做 ? 到目前为止,我已经开发了将在下面发布的代码。我可以进一步实现什么,以便该页面转到主页,获取信息,转到“子页面”获取信息,然后再次返回主页获取其他公告的信息? 谢谢。
代码:
import scrapy
class SapoSpider(scrapy.Spider):
name = "imo"
start_urls = [
'https://www.imovirtual.com/comprar/apartamento/lisboa/'
]
def parse(self, response):
for Property in response.css('div.offer-item-details'):
yield {
'preco': Property.css('span.offer-item
title::text').extract_first(),
'author': Property.css('li.offer-item-price::text').extract(),
'data': Property.css('li.offer-item-area::text').extract(),
'data': Property.css('li.offer-item-price-per-
m::text').extract(),
}
next_page = response.css('li.pager-next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
答案 0 :(得分:0)
您可以执行类似的操作(伪代码)
def parse(self, response):
for Property in response.css('div.offer-item-details'):
youritem = {
'preco': Property.css('span.offer-item title::text').extract_first(),
'author': Property.css('li.offer-item-price::text').extract(),
'data': Property.css('li.offer-item-area::text').extract(),
'data': Property.css('li.offer-item-price-perm::text').extract(),
}
# extract the sub page link
# load and parse it
yield scrapy.Request(subpage_link, callback=self.parse_subpage)
next_page = response.css('li.pager-next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def parse_subpage(self, youritem):
# here parse your subpage
# and complete your item information
yield youritem