如何使用scrapy从特定列表的主要列表页面以及详细信息页面中抓取数据

时间:2019-05-02 06:55:18

标签: python-3.x scrapy web-crawler

我正在搜寻一个带有属性列表的网站,并且仅在列表页面中找到了“购买/出租”。我通过从parse方法解析到每个parse_property方法的URL来从详细信息页面中提取了其他数据,但是我是无法从主列表页面获取商品类型。

我试图以解析单个网址的相同方式进行操作。(带注释的代码)


    def parse(self, response):
        properties = response.xpath('//div[@class="property-information-address"]/a')
            for property in properties:
                url= property.xpath('./@href').extract_first()
                yield Request(url, callback=self.parse_property, meta={'URL':url})
    # TODO: offering

    # offering=response.xpath('//div[@class="property-status"]')
    #     for of in offerings:
    #         offering=of.xpath('./a/text()').extract_first()
    #         yield Request(offering, callback=self.parse_property, meta={'Offering':offering})

        next_page=response.xpath('//div[@class="pagination"]/a/@href')[-2].extract()
        yield Request(next_page, callback=self.parse)

    def parse_property(self, response):
        l = ItemLoader(item=NPMItem(), response=response)
        url=response.meta.get('URL')
        #offer=response.meta.get('Offering')
        l.add_value('URL', response.url)
        #l.add_value('Offering', response.offer)

1 个答案:

答案 0 :(得分:1)

您可以尝试依赖DOM树中较高的元素,并从此处刮除属性类型和链接。检查此代码示例,它可以正常工作:

def parse(self, response):
    properties = response.xpath('//div[@class="property-listing"]')
    for property in properties:
        url = property.xpath('.//div[@class="property-information-address"]/a/@href').get()
        ptype = property.xpath('.//div[@class="property-status"]/a/text()').get()
        yield response.follow(url, self.parse_property, meta={'ptype': ptype})

    next_page = response.xpath('//link[@rel="next"]/@href').get()
    if next_page:
        yield response.follow(next_page, callback=self.parse)

def parse_property(self, response):
    print '======'
    print response.meta['ptype']
    print '======'
    # build your item here, printing is only to show content of `ptype`