Question

我正在使用python开发网络抓取工具，并且正在使用Scrapy。对于这个网络爬虫，我需要找到一个网站，其中包含广告的主页和需要获取信息的子页面。我怎么做？到目前为止，我已经开发了将在下面发布的代码。我可以进一步实现什么，以便该页面转到主页，获取信息，转到“子页面”获取信息，然后再次返回主页获取其他公告的信息？谢谢。

代码：

import scrapy


class SapoSpider(scrapy.Spider):
 name = "imo"
 start_urls = [
    'https://www.imovirtual.com/comprar/apartamento/lisboa/'
 ]

 def parse(self, response):
     for Property in response.css('div.offer-item-details'):
         yield {
            'preco': Property.css('span.offer-item 
title::text').extract_first(),
            'author': Property.css('li.offer-item-price::text').extract(),
            'data': Property.css('li.offer-item-area::text').extract(),
            'data': Property.css('li.offer-item-price-per- 
m::text').extract(),
 }

    next_page = response.css('li.pager-next a::attr(href)').extract_first()
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

Answer 1

您可以执行类似的操作（伪代码）

def parse(self, response):
     for Property in response.css('div.offer-item-details'):
         youritem = {
            'preco': Property.css('span.offer-item title::text').extract_first(),
            'author': Property.css('li.offer-item-price::text').extract(),
            'data': Property.css('li.offer-item-area::text').extract(),
            'data': Property.css('li.offer-item-price-perm::text').extract(),
         }

         # extract the sub page link
         # load and parse it
         yield scrapy.Request(subpage_link, callback=self.parse_subpage)

    next_page = response.css('li.pager-next a::attr(href)').extract_first()
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

 def parse_subpage(self, youritem):
     # here parse your subpage
     # and complete your item information
        yield youritem

如何收集具有不同链接的站点信息？

1 个答案: