我正在尝试让scrapy能够遍历到下一页继续抓取,但它只是在爬虫到达页面末尾时停止。以下是我的代码片段:
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["www.jobstreet.com.sg"]
start_urls = [
'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
)
知道如何让它发挥作用吗?以前我使用这段代码和遍历工作,但它只是总是停在第7页
next_page = response.xpath('//*[(@id = "page_next")]/@href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
编辑:
2017-09-09 15:48:35 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-09 15:48:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 16626,
'downloader/request_count': 20,
'downloader/request_method_count/GET': 20,
'downloader/response_bytes': 197475,
'downloader/response_count': 20,
'downloader/response_status_count/200': 20,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 9, 7, 48, 35, 66598),
'item_scraped_count': 19,
'log_count/DEBUG': 40,
'log_count/INFO': 28,
'log_count/WARNING': 2,
'memusage/max': 45748224,
'memusage/startup': 32595968,
'request_depth_max': 1,
'response_received_count': 20,
'scheduler/dequeued': 20,
'scheduler/dequeued/memory': 20,
'scheduler/enqueued': 20,
'scheduler/enqueued/memory': 20,
'start_time': datetime.datetime(2017, 9, 9, 7, 47, 1, 843551)}
2017-09-09 15:48:35 [scrapy.core.engine] INFO: Spider closed (finished)
Exception twisted._threads._ithreads.AlreadyQuit: AlreadyQuit() in <bound method JobstreetPipeline.__del__ of <jobstreet.pipelines.JobstreetPipeline object at 0x103c152d0>> ignored
当前代码:
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["www.jobstreet.com.sg"]
start_urls = [
'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
]
rules = (
# Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//*[@id="page_next"]',)), callback="self.parse", follow=True),
Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse', follow=True),
)
def parse(self, response):
items = []
...
...
...
item = JobsItems()
...
...
...
item['jobdetailsurl'] = sel.xpath('.//a[@class="position-title-link"]/@href').extract()[0]
request = scrapy.Request(item['jobdetailsurl'], callback=self.parse_jobdetails)
request.meta['item'] = item
yield request
答案 0 :(得分:1)
几个问题。不推荐使用SgmlLinkExtractor
。您应该使用LinkExtractor
。
callback="self.parse"
应该是没有self的函数名。如果你想从响应中提取数据,那么你应该使用另一个函数
class IT(CrawlSpider):
name = 'IT'
allowed_domains = ["www.jobstreet.com.sg"]
start_urls = [
'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?ojs=10&key=it',
]
rules = (
Rule(LinkExtractor(allow_domains=("jobstreet.com.sg", ), restrict_xpaths=('//*[@id="page_next"]',)), callback='parse_response', follow=True),
)
def parse_response(self, response):
yield {"page": response.url}
修改-2 强>
我还将response.body
添加到项目产量中,并在
<script src="https://www.google.com/recaptcha/api.js?hl="></script>\n\n </head>\n\n\t<body>\n\n <div id="main" class="main">\n <div class="captchaDiv">\n\t\t\t\t<div id="headerDiv" class="headerDiv">\n\t\t\t\t\t<div id="insideHeaderDiv" class="insideHeaderDiv"></div>\n\t\t\t\t</div>\n\t\t\t\t<div class="separator"></div>\n\t\t\t\t<div id="mainDiv" class="mainDiv">\n\t\t\t\t\t<div id="titleText" class="titleText">Security Check</div>\n\t\t\t\t\t<div id="instructionsText" class="instructionsText">Before we allow your access to this page, we need to confirm if you are a human (it\'s a spam prevention thing)</div>\n\t\t\t\t\t<div class="g-recaptcha" data-sitekey=\'6LcX6A4UAAAAAKK1WiuMtXOj6Ib-lXZwVaWGvkq6\' data-callback=\'mprv_captcha_submitUserInput\'></div>\n\t\t\t\t\t<div id="footerText" class="footerText">SPAM Prevention</div>\n\t\t\t\t</div>\n </div>\n </div>\n\t</body>\n</html>'
因此页面在几次请求后显示了验证码,因此刮擦停止。你需要通过减慢请求,或以某种方式解决验证码,或使用像crawlera这样的服务来解决这个问题
修改-1 强>
抓取时的输出
2017-09-09 13:10:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16> (referer: https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=15&src=16&srcr=16)
2017-09-09 13:10:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16>
{'page': 'https://www.jobstreet.com.sg/en/job-search/job-vacancy.php?key=it&area=1&option=1&job-source=1%2C64&classified=1&job-posted=0&sort=2&order=0&pg=16&src=16&srcr=16'}
2017-09-09 13:10:06 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-09 13:10:06 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15664,
'downloader/request_count': 16,
'downloader/request_method_count/GET': 16,
'downloader/response_bytes': 306524,
'downloader/response_count': 16,
'downloader/response_status_count/200': 16,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 9, 7, 40, 6, 294685),
'item_scraped_count': 15,
'log_count/DEBUG': 31,
'log_count/INFO': 7,
'memusage/max': 49135616,
'memusage/startup': 49135616,
'request_depth_max': 15,
'response_received_count': 16,
'scheduler/dequeued': 16,
'scheduler/dequeued/memory': 16,
'scheduler/enqueued': 16,
'scheduler/enqueued/memory': 16,
'start_time': datetime.datetime(2017, 9, 9, 7, 39, 56, 380899)}
2017-09-09 13:10:06 [scrapy.core.engine] INFO: Spider closed (finished)
答案 1 :(得分:0)
您可以使用以下代码进行分页。
next_page = response.xpath('//*[@id="page_next"]/@href').extract_first()
if next_page:
yield scrapy.Request(next_page, self.parse)