如果网页包含一些数据,如何跳过Spider的一次迭代?
我们在页面上有几个页面标题。我跳过其他数据(日期,喜欢)。
page 1 title: 'We like cats' # this title is valid
page 2 title: 'This title contains WORD X...' # this title is not valid (skip it)
page 3 title: 'Best ideas' # this title is valid
from scrapy.spider import CrawlSpider
class Carflix(CrawlSpider):
name = 'carflix'
allowed_domains = ['sitex.com']
start_urls = ['http://sitex.com/page-1.html',
'http://sitex.com/page-2.html',
'http://sitex.com/page-2.html']
def parse(self, response):
date = response.xpath('//div[@class="date"]/text()').extract_first()
pagetitle = response.xpath('//div[@class="title"]/text()').extract_first()
if 'WORD X' in pagetitle:
# what need to do that skip adding data if page title contains 'WORD X'
likes = response.xpath('//div[@class="likes"]/text()').extract_first()
yield{
'pagetitle': pagetitle,
'date': date,
'likes': likes,
}
[{
'pagetitle': 'We like cats',
'date': '01/01/2019',
'likes': 200
},
{
'pagetitle': 'Best ideas',
'date': '02/01/2019',
'likes': 100
}]```
答案 0 :(得分:1)
仅在指定条件下产生结果:
def parse(self, response):
date = response.xpath('//div[@class="date"]/text()').extract_first()
pagetitle = response.xpath('//div[@class="title"]/text()').extract_first()
likes = response.xpath('//div[@class="likes"]/text()').extract_first()
if not 'WORD X' in pagetitle:
yield {
'pagetitle': pagetitle,
'date': date,
'likes': likes,
}