Question

我正在使用最新的scrapy版本，v1.3

我按照分页中的网址逐页抓取网页。在某些页面中，网站检测到我使用机器人并在html中给我一个错误。由于它是一个成功的请求，它会缓存页面，当我再次运行它时，我得到同样的错误。

我需要的是如何防止该页面进入缓存？或者如果我不能这样做，我需要在解析方法中发现错误后将其从缓存中删除。然后我可以重试并获得正确的。

我有一个部分解决方案，我使用＆＃34; dont_cache＆＃34;：meta中的False参数产生所有请求，因此我确保它们使用缓存。在我检测到错误并重试请求的地方，我将dont_filter = True和＆＃34; dont_cache＆＃34;：True放在一起以确保我获得了错误网址的新副本。

def parse(self, response):
    page = response.meta["page"] + 1
    html = Selector(response)

    counttext = html.css('h2#s-result-count::text').extract_first()
    if counttext is None:
        page = page - 1
        yield Request(url=response.url, callback=self.parse, meta={"page":page, "dont_cache":True}, dont_filter=True)

我还尝试了一个自定义重试中间件，我设法让它在缓存之前工作，但我无法成功读取response.body。我怀疑它是以某种方式压缩的，因为它是二进制数据。

class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):
        with open('debug.txt', 'wb') as outfile:
            outfile.write(response.body)
        html = Selector(text=response.body)

        url = response.url

        counttext = html.css('h2#s-result-count::text').extract_first()
        if counttext is None:
            log.msg("Automated process error: %s" %url, level=log.INFO)
            reason = 'Automated process error %d' %response.status
            return self._retry(request, reason, spider) or response
        return response

任何建议都表示赞赏。

由于

穆罕默德

Answer 1

负责请求/响应缓存的中间件是HttpCacheMiddleware。在引擎盖下，它由缓存策略驱动 - 特殊类调度应该或不应该缓存哪些请求和响应。您可以实现自己的缓存策略类，并将其与设置

一起使用

HTTPCACHE_POLICY =＆＃39; my.custom.cache.Class＆＃39;

文档中的更多信息：https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

内置政策的源代码：https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/httpcache.py#L18

Answer 2

感谢mizhgun，我设法使用自定义策略开发解决方案。

这是我做的，

from scrapy.utils.httpobj import urlparse_cached

class CustomPolicy(object):

    def __init__(self, settings):
        self.ignore_schemes = settings.getlist('HTTPCACHE_IGNORE_SCHEMES')
        self.ignore_http_codes = [int(x) for x in settings.getlist('HTTPCACHE_IGNORE_HTTP_CODES')]

    def should_cache_request(self, request):
        return urlparse_cached(request).scheme not in self.ignore_schemes

    def should_cache_response(self, response, request):
        return response.status not in self.ignore_http_codes

    def is_cached_response_fresh(self, response, request):
        if "refresh_cache" in request.meta:
            return False
        return True

    def is_cached_response_valid(self, cachedresponse, response, request):
        if "refresh_cache" in request.meta:
            return False
        return True

当我发现错误时，（当然发生缓存之后）

def parse(self, response):
    html = Selector(response)

    counttext = html.css('selector').extract_first()
    if counttext is None:
        yield Request(url=response.url, callback=self.parse, meta={"refresh_cache":True}, dont_filter=True)

将meta_cache添加到meta时，可以在自定义策略类中捕获。

不要忘记添加dont_filter，否则第二个请求将被重复过滤。

Scrapy如何从httpcache中删除URL或阻止添加到缓存

2 个答案: