Scrapy crawler - 启用Frontera中间件从响应对象

时间:2015-09-01 15:14:46

标签: python scrapy frontera

当我在scrapy中启用以下frontera中间件时

我丢失了所有response个对象

中的所有引用标头

无论如何我能保留推荐人吗?

当我删除以下行时,referer可用,但我需要启用这些frontera中间件

SPIDER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
})


DOWNLOADER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
})

SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'

此外,referermiddleware已启用,我可以在scrapy开始时在调试日志中看到它



编辑:这是我的整个配置文件内容

BOT_NAME = 'crawler'

SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'

USER_AGENT = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36"

DOWNLOAD_DELAY = 2


DUPEFILTER=True


ITEM_PIPELINES = {
    'crawler.pipelines.AllDataPipeline': 300
}


SPIDER_MIDDLEWARES = {}

DOWNLOADER_MIDDLEWARES = {}


RETRY_ENABLED = True
RETRY_TIMES = 5
RETRY_HTTP_CODES = [500, 502, 503, 504, 400, 408]
REFERER_ENABLED = True

######################################################################
# Frontera Settings 
#######################################################################


BACKEND = 'frontera.contrib.backends.sqlalchemy.FIFO'
SQLALCHEMYBACKEND_ENGINE = 'sqlite:///frontier.db'


HTTPCACHE_ENABLED = False
REDIRECT_ENABLED = True
COOKIES_ENABLED = False
DOWNLOAD_TIMEOUT = 20
RETRY_ENABLED = False

CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 2

LOGSTATS_INTERVAL = 10

SPIDER_MIDDLEWARES = {}
DOWNLOADER_MIDDLEWARES = {}





SPIDER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 699,
})


DOWNLOADER_MIDDLEWARES.update({
    'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
})

SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'

1 个答案:

答案 0 :(得分:-1)

DupeFilter不能只是真的。 您可以设置

DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter.RFPDupeFilter"

如果您想要执行没有重复过滤器的请求,可以将dont_filter=True个kwargs添加到scrapy.Request