Question

某些pdf网址不以“.pdf”结尾，因此，我们只能在检查响应标头后找到。我想避免下载这样的pdf。在Scrapy中，在完全下载响应后检查标题很容易。我如何才能下载并检查响应标题并稍后才下载主体？

Answer 1

使用HTTP请求方法HEAD来获取标头。然后检查Content-Type并基于此，您可以使用GET方法放置相同的请求。请参阅此最小工作示例：

# -*- coding: utf-8 -*-
from __future__ import print_function, unicode_literals
import scrapy

class DummySpider(scrapy.Spider):
    name = 'dummy'

    def start_requests(self):
        yield scrapy.Request('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf',
                             callback=self.parse_headers, method='HEAD')

    def parse_headers(self, response):
        if response.headers['Content-Type'].startswith('application/pdf'):
            yield response.request.replace(callback=self.parse, method='GET')

    def parse(self, response):
        print(len(response.body))

在没有下载正文的情况下检查scrapy中的响应头

1 个答案: