如何收集具有不同链接的站点信息?

时间:2018-06-19 16:35:28

标签: python scrapy web-crawler scrapy-spider

我正在使用python开发网络抓取工具,并且正在使用Scrapy。对于这个网络爬虫,我需要找到一个网站,其中包含广告的主页和需要获取信息的子页面。我怎么做 ? 到目前为止,我已经开发了将在下面发布的代码。我可以进一步实现什么,以便该页面转到主页,获取信息,转到“子页面”获取信息,然后再次返回主页获取其他公告的信息? 谢谢。

代码:

import scrapy


class SapoSpider(scrapy.Spider):
 name = "imo"
 start_urls = [
    'https://www.imovirtual.com/comprar/apartamento/lisboa/'
 ]

 def parse(self, response):
     for Property in response.css('div.offer-item-details'):
         yield {
            'preco': Property.css('span.offer-item 
title::text').extract_first(),
            'author': Property.css('li.offer-item-price::text').extract(),
            'data': Property.css('li.offer-item-area::text').extract(),
            'data': Property.css('li.offer-item-price-per- 
m::text').extract(),
 }

    next_page = response.css('li.pager-next a::attr(href)').extract_first()
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

1 个答案:

答案 0 :(得分:0)

您可以执行类似的操作(伪代码)

def parse(self, response):
     for Property in response.css('div.offer-item-details'):
         youritem = {
            'preco': Property.css('span.offer-item title::text').extract_first(),
            'author': Property.css('li.offer-item-price::text').extract(),
            'data': Property.css('li.offer-item-area::text').extract(),
            'data': Property.css('li.offer-item-price-perm::text').extract(),
         }

         # extract the sub page link
         # load and parse it
         yield scrapy.Request(subpage_link, callback=self.parse_subpage)

    next_page = response.css('li.pager-next a::attr(href)').extract_first()
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

 def parse_subpage(self, youritem):
     # here parse your subpage
     # and complete your item information
        yield youritem