Question

我有一只抓痒的蜘蛛，它按预期工作了一段时间，但现在返回空响应。

class BossSpider(scrapy.Spider):
    name = 'bossaz'
    allowed_domains = ['boss.az']
    start_urls = ['https://boss.az/vacancies']

    def parse(self, response):
        for href in response.xpath('//a[@class="results-i-link"]/@href'):
            yield response.follow(href, self.parse_jobs)

        next_page = response.xpath('//span[@class="next"]/a[@rel="next"]/@href').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

    def parse_jobs(self, response):
        scraped_data = dict()
        scraped_data['job_title'] = response.xpath('//h1[@class="post-title"]/text()').extract_first()
        scraped_data['employer'] = response.xpath('//a[@class="post-company"]/text()').extract_first()
        scraped_data['published'] = response.xpath('//div[@class="bumped_on params-i-val"]/text()').extract_first()
        scraped_data['details'] = response.xpath('//div[@class="post-cols post-info"]').extract()
        yield scraped_data

当我在机器上运行Spider时，上面的代码现在返回以下统计信息：

{'downloader/request_bytes': 431,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 304,
 'downloader/response_count': 2,
 'downloader/response_status_count/204': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 8, 30, 5, 30, 18, 860994),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'memusage/max': 53403648,
 'memusage/startup': 53403648,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 8, 30, 5, 30, 17, 554091)}

我还尝试通过输入scrapy shell https://boss.az/vacancies在终端中获得结果。在终端中，response.body还返回空字符串。请注意，我检查了网站的HTML代码，没有结构上的更改。该蜘蛛返回HTTP状态204的原因是什么？

Scrapy Response 204无内容

0 个答案: