我有一个scrapy蜘蛛,我试图做分页但每次我开始爬行过程,它似乎跳过第1页的起始页并立即转到第2页
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["jobscentral.com.sg"]
start_urls = [
'https://jobscentral.com.sg/jobs-accounting',
]
rules = (Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ),
restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)),
callback='parse_item', follow=True),
)
def parse_item(self, response):
self.logger.info("Response %d for %r" % (response.status, response.url))
#self.logger.info("base url %s", get_base_url(response))
items = []
self.logger.info("Visited Outer Link %s", response.url)
for loop in response.xpath('//div[@class="col-md-11"]'):
item = JobsItems()
t = loop.xpath('./div[@class="col-xs-12 col-md-3 px-0"]/div[@class="posted-date text-muted hidden-sm-down"]//text()').extract()[1].strip()
....
more codes here
答案 0 :(得分:1)
是的,这是正确的,因为当您使用start_urls
时,响应首次转到parse
方法。此方法由内部CrawlSpider
定义以执行爬网规则。因此,如果您还需要处理来自第一个响应的响应。您可以使用以下内容
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["jobscentral.com.sg"]
start_urls = [
'https://jobscentral.com.sg/jobs-accounting',
]
rules = (
Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback='parse_item', follow=True),
)
first_response = True
def parse(self, response):
if self.first_response = True:
# use it or pass it to some other function
for r in parse_item(response):
yield r
self.first_response = False
# Pass the response to crawlspider
for r in super(IT, self).parse(response)
yield r
def parse_item(self, response):
self.logger.info("Response %d for %r" % (response.status, response.url))