Question

我需要重新处理之前下载的网站，而无需再次下载。

所以我想创建多个scrapy.Response，而不会产生任何scrapy.Request。

应先处理这些响应，然后再下载任何新的响应。
我们假设响应内容（url，body等）是从某处加载的。
内置HTTP缓存不适合，因为它需要请求...
我不希望蜘蛛必须照顾这一点。

也许扩展可以做到这一点-中间件也可以。最小示例：

from scrapy import signals
from scrapy.http import Response

class ReprocessSnapshotsOnSpiderOpenExtension(object):

    def __init__(self, crawler):
        self.crawler = crawler
        crawler.signals.connect(self.send_the_existing_snapshots_as_new_response, signal=signals.spider_opened)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def send_the_existing_snapshots_as_new_response(self, spider):
        print("##### now in ReprocessSnapshotsOnSpiderOpenExtension.send_the_existing_snapshots_as_new_responses()")

        response1 = Response("http://the_url_of_resp1", body=b"the body of resp1")
        response2 = Response("http://the_url_of_resp2", body=b"the body of resp2")
        # ....
        responseN = Response("http://the_url_of_respN", body=b"the body of respN")

        inject_response_somehow(response1) 
        inject_response_somehow(response2)
        # ...
        inject_response_somehow(responseN)

所以问题是：如何实现inject_response_somehow(...)？

是否有可能控制在哪里（在下载中间件/蜘蛛中间件之间/之后/之后）中注入响应？

如何在没有请求的情况下创建草率的响应？

0 个答案: