使用用户接受的值初始化Scrapy设置作为参数

时间:2019-02-18 12:35:09

标签: scrapy

我想将HTTPCACHE_DIR设置设置为用户通过自定义参数提供的值。

1 个答案:

答案 0 :(得分:0)

通过defalut,Scrapy在FileSystemCacheStorage中使用了HttpCacheMiddleware中的HTTPCACHE_DIR设置:

class FilesystemCacheStorage(object):

    def __init__(self, settings):
        self.cachedir = data_path(settings['HTTPCACHE_DIR'])
        self.expiration_secs = settings.getint('HTTPCACHE_EXPIRATION_SECS')
        self.use_gzip = settings.getbool('HTTPCACHE_GZIP')
        self._open = gzip.open if self.use_gzip else open

正如您所看到的,当Scrapy创建HTTPCACHE_DIR时,Scrapy仅读取FilesystemCacheStorage设置参数。即使您稍后以某种方式更改了HTTPCACHE_DIR设置,它也不会更改cachedir。
在抓取过程中更改cachedir的唯一方法是更改​​cachedir对象的FilesystemCacheStorage属性。 您可以在您的Spider代码中实现此功能:
(用于scrapy crawl myspider -a HTTPCACHE_DIR="cache_dir"

import scrapy
class MySpider(scrapy.Spider):
    def start_requests(self):
        if self.HTTPCACHE_DIR:
            #Select downloader middlewares
            downloader_middlewares = self.crawler.engine.downloader.middleware.middlewares
            #Select HttpCacheMiddleware
            HttpCacheMiddleware = [middleware for middleware in downloader_middlewares if "HttpCacheMiddleware" in str(type(middleware))][0]
            #Change cachedir
            HttpCacheMiddleware.storage.cachedir = scrapy.utils.project.data_path(self.HTTPCACHE_DIR)