我有一只抓痒的蜘蛛,它按预期工作了一段时间,但现在返回空响应。
class BossSpider(scrapy.Spider):
name = 'bossaz'
allowed_domains = ['boss.az']
start_urls = ['https://boss.az/vacancies']
def parse(self, response):
for href in response.xpath('//a[@class="results-i-link"]/@href'):
yield response.follow(href, self.parse_jobs)
next_page = response.xpath('//span[@class="next"]/a[@rel="next"]/@href').extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
def parse_jobs(self, response):
scraped_data = dict()
scraped_data['job_title'] = response.xpath('//h1[@class="post-title"]/text()').extract_first()
scraped_data['employer'] = response.xpath('//a[@class="post-company"]/text()').extract_first()
scraped_data['published'] = response.xpath('//div[@class="bumped_on params-i-val"]/text()').extract_first()
scraped_data['details'] = response.xpath('//div[@class="post-cols post-info"]').extract()
yield scraped_data
当我在机器上运行Spider时,上面的代码现在返回以下统计信息:
{'downloader/request_bytes': 431,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 304,
'downloader/response_count': 2,
'downloader/response_status_count/204': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 8, 30, 5, 30, 18, 860994),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'memusage/max': 53403648,
'memusage/startup': 53403648,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 8, 30, 5, 30, 17, 554091)}
我还尝试通过输入scrapy shell https://boss.az/vacancies
在终端中获得结果。在终端中,response.body
还返回空字符串。请注意,我检查了网站的HTML代码,没有结构上的更改。该蜘蛛返回HTTP状态204的原因是什么?