抓取爬网的增量错误仅返回第一页

时间:2020-04-29 13:58:49

标签: python python-3.x web-scraping scrapy

我正在尝试浏览连续的页面,后缀以20为增量(基于每个页面中列表的数量)增加

第一页是:https://www.daft.ie/dublin-city/property-for-sale/dublin-4/

第二个是:https://www.daft.ie/dublin-city/property-for-sale/dublin-4/?offset=20

第10页是:https://www.daft.ie/dublin-city/property-for-sale/dublin-4/?offset=180

我已经检查了缩进,看起来不错,但只返回20个列表的第一页 这是spider.py文件,非常感谢您提供任何建议

import scrapy


class DaftieSpiderSpider(scrapy.Spider):
name = 'daftie_spider'
page_number = 20
allowed_domains = ['https://www.daft.ie/dublin-city/property-for-sale/dublin-4/']
start_urls = ['https://www.daft.ie/dublin-city/property-for-sale/dublin-4/']

def parse(self, response):
    listings = response.xpath('//div[@class="PropertyCardContainer__container"]')
    for listing in listings:
        price = listing.xpath('.//a/strong[@class="PropertyInformationCommonStyles__costAmountCopy"]/text()').extract_first()
        address = listing.xpath('.//*[@class="PropertyInformationCommonStyles__addressCopy--link"]/text()').extract_first()
        bedrooms = listing.xpath('.//*[@class="QuickPropertyDetails__iconCopy"]/text()').extract_first()
        bathrooms = listing.xpath('.//*[@class="QuickPropertyDetails__iconCopy--WithBorder"]/text()').extract_first()
        prop_type = listing.xpath('.//*[@class="QuickPropertyDetails__propertyType"]/text()').extract_first()
        agent = listing.xpath('.//div[@class="BrandedHeader__agentLogoContainer"]/img/@alt').extract_first()

        yield{'price': price,
              'address': address,
              'bedrooms': bedrooms,
              'bathrooms': bathrooms,
              'prop_type': prop_type,
              'agent': agent}

        next_page = 'https://www.daft.ie/dublin-city/property-for-sale/dublin-4/?offset=' + str(DaftieSpiderSpider.page_number)
        if DaftieSpiderSpider.page_number <= 180:
            DaftieSpiderSpider.page_number += 20
            yield response.follow(next_page, callback=self.parse)

2 个答案:

答案 0 :(得分:1)

不确定是否是由于格式问题,但是您正在列表循环中将值增加20。无论如何,我都会尝试不适应这样的类变量。

以下对我来说效果更好:

import scrapy


class DaftieSpiderSpider(scrapy.Spider):
    name = 'daftie_spider'
    page_number = 20
    allowed_domains = ['daft.ie']
    start_urls = ['https://www.daft.ie/dublin-city/property-for-sale/dublin-4/']

    def parse(self, response):
        offset = response.meta.get('offset', 0)
        listings = response.xpath('//div[@class="PropertyCardContainer__container"]')
        for listing in listings:
            price = listing.xpath('.//a/strong[@class="PropertyInformationCommonStyles__costAmountCopy"]/text()').extract_first()
            address = listing.xpath('.//*[@class="PropertyInformationCommonStyles__addressCopy--link"]/text()').extract_first()
            bedrooms = listing.xpath('.//*[@class="QuickPropertyDetails__iconCopy"]/text()').extract_first()
            bathrooms = listing.xpath('.//*[@class="QuickPropertyDetails__iconCopy--WithBorder"]/text()').extract_first()
            prop_type = listing.xpath('.//*[@class="QuickPropertyDetails__propertyType"]/text()').extract_first()
            agent = listing.xpath('.//div[@class="BrandedHeader__agentLogoContainer"]/img/@alt').extract_first()

            yield{'price': price,
                  'address': address,
                  'bedrooms': bedrooms,
                  'bathrooms': bathrooms,
                  'prop_type': prop_type,
                  'agent': agent}

        if offset <= 180:
            offset += 20
            next_page = 'https://www.daft.ie/dublin-city/property-for-sale' \
                        '/dublin-4/?offset=' + str(offset)
            yield response.follow(next_page,
                                  callback=self.parse,
                                  meta={'offset': offset})

答案 1 :(得分:0)

起作用的最终代码: 非常感谢您的帮助

<ol>
{
     items.map(todo => (
     <li key={todo.taskId} className={todo.completed ? 'active' : 'inactive'}>
         <span onClick={() => dispatch(updateTodo())}>{todo.task}</span>
         <div className='hidden updatePanel'>
             <input type='text' value={todo.task}/>
             <input type='checkbox' checked={todo.completed}></input>
         </div>
     </li>
     ))
}
</ol>