当我在scrapy中启用以下frontera中间件时
我丢失了所有response
个对象
无论如何我能保留推荐人吗?
当我删除以下行时,referer可用,但我需要启用这些frontera中间件
SPIDER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
})
DOWNLOADER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
})
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
此外,referermiddleware
已启用,我可以在scrapy开始时在调试日志中看到它
编辑:这是我的整个配置文件内容
BOT_NAME = 'crawler'
SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'
USER_AGENT = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36"
DOWNLOAD_DELAY = 2
DUPEFILTER=True
ITEM_PIPELINES = {
'crawler.pipelines.AllDataPipeline': 300
}
SPIDER_MIDDLEWARES = {}
DOWNLOADER_MIDDLEWARES = {}
RETRY_ENABLED = True
RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 408]
REFERER_ENABLED = True
######################################################################
# Frontera Settings
#######################################################################
BACKEND = 'frontera.contrib.backends.sqlalchemy.FIFO'
SQLALCHEMYBACKEND_ENGINE = 'sqlite:///frontier.db'
HTTPCACHE_ENABLED = False
REDIRECT_ENABLED = True
COOKIES_ENABLED = False
DOWNLOAD_TIMEOUT = 20
RETRY_ENABLED = False
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 2
LOGSTATS_INTERVAL = 10
SPIDER_MIDDLEWARES = {}
DOWNLOADER_MIDDLEWARES = {}
SPIDER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 699,
})
DOWNLOADER_MIDDLEWARES.update({
'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
})
SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
答案 0 :(得分:-1)
DupeFilter不能只是真的。 您可以设置
DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter.RFPDupeFilter"
如果您想要执行没有重复过滤器的请求,可以将dont_filter=True
个kwargs添加到scrapy.Request