我需要重新处理之前下载的网站,而无需再次下载。
所以我想创建多个scrapy.Response,而不会产生任何scrapy.Request。
也许扩展可以做到这一点-中间件也可以。最小示例:
from scrapy import signals
from scrapy.http import Response
class ReprocessSnapshotsOnSpiderOpenExtension(object):
def __init__(self, crawler):
self.crawler = crawler
crawler.signals.connect(self.send_the_existing_snapshots_as_new_response, signal=signals.spider_opened)
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def send_the_existing_snapshots_as_new_response(self, spider):
print("##### now in ReprocessSnapshotsOnSpiderOpenExtension.send_the_existing_snapshots_as_new_responses()")
response1 = Response("http://the_url_of_resp1", body=b"the body of resp1")
response2 = Response("http://the_url_of_resp2", body=b"the body of resp2")
# ....
responseN = Response("http://the_url_of_respN", body=b"the body of respN")
inject_response_somehow(response1)
inject_response_somehow(response2)
# ...
inject_response_somehow(responseN)
所以问题是:如何实现inject_response_somehow(...)
?
是否有可能控制在哪里(在下载中间件/蜘蛛中间件之间/之后/之后)中注入响应?