Scrapy忽略开始页面并继续下一页

时间:2017-09-17 18:23:35

标签: python web-scraping scrapy

我有一个scrapy蜘蛛,我试图做分页但每次我开始爬行过程,它似乎跳过第1页的起始页并立即转到第2页

class IT(CrawlSpider):
    name = 'IT'

allowed_domains = ["jobscentral.com.sg"]
start_urls = [
    'https://jobscentral.com.sg/jobs-accounting',
]

rules = (Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ),
                     restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), 
                     callback='parse_item', follow=True),
        )

def parse_item(self, response):
    self.logger.info("Response %d for %r" % (response.status, response.url))
    #self.logger.info("base url %s", get_base_url(response))
    items = []
    self.logger.info("Visited Outer Link %s", response.url)

    for loop in response.xpath('//div[@class="col-md-11"]'):
        item = JobsItems()
        t = loop.xpath('./div[@class="col-xs-12 col-md-3 px-0"]/div[@class="posted-date text-muted hidden-sm-down"]//text()').extract()[1].strip()

.... 
more codes here

1 个答案:

答案 0 :(得分:1)

是的,这是正确的,因为当您使用start_urls时,响应首次转到parse方法。此方法由内部CrawlSpider定义以执行爬网规则。因此,如果您还需要处理来自第一个响应的响应。您可以使用以下内容

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["jobscentral.com.sg"]
    start_urls = [
        'https://jobscentral.com.sg/jobs-accounting',
    ]
    rules = (
        Rule(LinkExtractor(allow_domains=("jobscentral.com.sg", ), restrict_xpaths=('//li[@class="page-item"]/a[@aria-label="Next"]',)), callback='parse_item', follow=True),
    )

    first_response = True

    def parse(self, response):
        if self.first_response = True:
            # use it or pass it to some other function
            for r in parse_item(response):
                yield r
           self.first_response = False

        # Pass the response to crawlspider 
        for r in super(IT, self).parse(response)
            yield r


    def parse_item(self, response):

        self.logger.info("Response %d for %r" % (response.status, response.url))