如何从迭代中跳过一页?

时间:2019-03-28 14:20:07

标签: python scrapy

如果网页包含一些数据,如何跳过Spider的一次迭代?

页面标题:

我们在页面上有几个页面标题。我跳过其他数据(日期,喜欢)。

page 1 title: 'We like cats'  # this title is valid
page 2 title: 'This title contains WORD X...'  # this title is not valid (skip it)
page 3 title: 'Best ideas'  # this title is valid

代码:

from scrapy.spider import CrawlSpider

class Carflix(CrawlSpider):
    name = 'carflix'
    allowed_domains = ['sitex.com']
    start_urls = ['http://sitex.com/page-1.html',
                  'http://sitex.com/page-2.html',
                  'http://sitex.com/page-2.html']

    def parse(self, response):
        date = response.xpath('//div[@class="date"]/text()').extract_first()
        pagetitle = response.xpath('//div[@class="title"]/text()').extract_first()
        if 'WORD X' in pagetitle:
            # what need to do that skip adding data if page title contains 'WORD X'
        likes = response.xpath('//div[@class="likes"]/text()').extract_first()
        yield{
            'pagetitle': pagetitle,
            'date': date,
            'likes': likes,
        }

结果应为:

[{
    'pagetitle': 'We like cats',
    'date': '01/01/2019',
    'likes': 200
},
{
    'pagetitle': 'Best ideas',
    'date': '02/01/2019',
    'likes': 100
}]```

1 个答案:

答案 0 :(得分:1)

仅在指定条件下产生结果:


def parse(self, response):
    date = response.xpath('//div[@class="date"]/text()').extract_first()
    pagetitle = response.xpath('//div[@class="title"]/text()').extract_first()
    likes = response.xpath('//div[@class="likes"]/text()').extract_first()
    if not 'WORD X' in pagetitle:
        yield {
          'pagetitle': pagetitle,
          'date': date,
          'likes': likes,
         }