Question

我正在尝试让scrapy能够遍历到下一页继续抓取，但它只是在爬虫到达页面末尾时停止。以下是我的代码片段：

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
    )

知道如何让它发挥作用吗？以前我使用这段代码和遍历工作，但它只是总是停在第7页

next_page = response.xpath('//*[(@id = "page_next")]/@href')

if next_page:
    url = response.urljoin(next_page[0].extract())
    yield scrapy.Request(url, self.parse)

编辑：

    2017-09-09 15:48:35 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-09 15:48:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16626,
 'downloader/request_count': 20,
 'downloader/request_method_count/GET': 20,
 'downloader/response_bytes': 197475,
 'downloader/response_count': 20,
 'downloader/response_status_count/200': 20,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 9, 7, 48, 35, 66598),
 'item_scraped_count': 19,
 'log_count/DEBUG': 40,
 'log_count/INFO': 28,
 'log_count/WARNING': 2,
 'memusage/max': 45748224,
 'memusage/startup': 32595968,
 'request_depth_max': 1,
 'response_received_count': 20,
 'scheduler/dequeued': 20,
 'scheduler/dequeued/memory': 20,
 'scheduler/enqueued': 20,
 'scheduler/enqueued/memory': 20,
 'start_time': datetime.datetime(2017, 9, 9, 7, 47, 1, 843551)}
2017-09-09 15:48:35 [scrapy.core.engine] INFO: Spider closed (finished)
Exception twisted._threads._ithreads.AlreadyQuit: AlreadyQuit() in <bound method JobstreetPipeline.__del__ of <jobstreet.pipelines.JobstreetPipeline object at 0x103c152d0>> ignored

当前代码：

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
    ]

    rules = (
#       Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
        Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse', follow=True),
    )

  def parse(self, response):  

        items = []
            ...
            ...
            ...
            item = JobsItems()
            ...
            ...
            ...
           item['jobdetailsurl'] = sel.xpath('.//a[@class="position-title-link"]/@href').extract()[0]

            request = scrapy.Request(item['jobdetailsurl'], callback=self.parse_jobdetails)
            request.meta['item'] = item
            yield request

Answer 1

几个问题。不推荐使用SgmlLinkExtractor。您应该使用LinkExtractor。

callback="self.parse"应该是没有self的函数名。如果你想从响应中提取数据，那么你应该使用另一个函数

class IT(CrawlSpider):
    name = 'IT'

    allowed_domains = ["www.jobstreet.com.sg"]
    start_urls = [
        'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
    ]

    rules = (
        Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse_response', follow=True),
    )

    def parse_response(self, response):
        yield {"page": response.url}

修改-2

我还将response.body添加到项目产量中，并在
下面找到
<script src="https://www.google.com/recaptcha/api.js?hl="></script>\n\n </head>\n\n\t<body>\n\n <div id="main" class="main">\n <div class="captchaDiv">\n\t\t\t\t<div id="headerDiv" class="headerDiv">\n\t\t\t\t\t<div id="insideHeaderDiv" class="insideHeaderDiv"></div>\n\t\t\t\t</div>\n\t\t\t\t<div class="separator"></div>\n\t\t\t\t<div id="mainDiv" class="mainDiv">\n\t\t\t\t\t<div id="titleText" class="titleText">Security Check</div>\n\t\t\t\t\t<div id="instructionsText" class="instructionsText">Before we allow your access to this page, we need to confirm if you are a human (it\'s a spam prevention thing)</div>\n\t\t\t\t\t<div class="g-recaptcha" data-sitekey=\'6LcX6A4UAAAAAKK1WiuMtXOj6Ib-lXZwVaWGvkq6\' data-callback=\'mprv_captcha_submitUserInput\'></div>\n\t\t\t\t\t<div id="footerText" class="footerText">SPAM Prevention</div>\n\t\t\t\t</div>\n </div>\n </div>\n\t</body>\n</html>'

因此页面在几次请求后显示了验证码，因此刮擦停止。你需要通过减慢请求，或以某种方式解决验证码，或使用像crawlera这样的服务来解决这个问题

修改-1

抓取时的输出

2017-09-09 13:10:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16> (referer: https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=15&src=16&srcr=16) 2017-09-09 13:10:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16> {'page': 'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16'} 2017-09-09 13:10:06 [scrapy.core.engine] INFO: Closing spider (finished) 2017-09-09 13:10:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 15664, 'downloader/request_count': 16, 'downloader/request_method_count/GET': 16, 'downloader/response_bytes': 306524, 'downloader/response_count': 16, 'downloader/response_status_count/200': 16, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 9, 9, 7, 40, 6, 294685), 'item_scraped_count': 15, 'log_count/DEBUG': 31, 'log_count/INFO': 7, 'memusage/max': 49135616, 'memusage/startup': 49135616, 'request_depth_max': 15, 'response_received_count': 16, 'scheduler/dequeued': 16, 'scheduler/dequeued/memory': 16, 'scheduler/enqueued': 16, 'scheduler/enqueued/memory': 16, 'start_time': datetime.datetime(2017, 9, 9, 7, 39, 56, 380899)} 2017-09-09 13:10:06 [scrapy.core.engine] INFO: Spider closed (finished)

Answer 2

您可以使用以下代码进行分页。

next_page = response.xpath('//*[@id="page_next"]/@href').extract_first()

if next_page:
    yield scrapy.Request(next_page, self.parse)

Scrapy无法遍历下一页

2 个答案: