如何为不同的蜘蛛设置不同的scrapy设置?

时间:2013-10-11 21:17:25

标签: scrapy

我想为一些蜘蛛启用一些http代理,并为其他蜘蛛禁用它们。

我可以这样做吗?

# settings.py
proxy_spiders = ['a1' , b2']

if spider in proxy_spider: #how to get spider name ???
    HTTP_PROXY = 'http://127.0.0.1:8123'
    DOWNLOADER_MIDDLEWARES = {
         'myproject.middlewares.RandomUserAgentMiddleware': 400,
         'myproject.middlewares.ProxyMiddleware': 410,
         'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
    }
else:
    DOWNLOADER_MIDDLEWARES = {
         'myproject.middlewares.RandomUserAgentMiddleware': 400,
         'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
    }

如果上面的代码不起作用,还有其他建议吗?

5 个答案:

答案 0 :(得分:33)

有点晚了,但是自1.0.0发布以来,scrapy中有一个新功能,您可以像这样覆盖每个蜘蛛的设置:

class MySpider(scrapy.Spider):
    name = "my_spider"
    custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
                       "DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
                                                  'myproject.middlewares.ProxyMiddleware': 410,
                                                  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}




class MySpider2(scrapy.Spider):
        name = "my_spider2"
        custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
                                                      'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}

答案 1 :(得分:10)

有一种新的更简单的方法可以做到这一点。

class MySpider(scrapy.Spider):
    name = 'myspider'

    custom_settings = {
        'SOME_SETTING': 'some value',
    }

我使用Scrapy 1.3.1

答案 2 :(得分:8)

您可以在spider.py文件中添加setting.overrides 有效的例子:

from scrapy.conf import settings

settings.overrides['DOWNLOAD_TIMEOUT'] = 300 

对你来说,这样的事也应该有用

from scrapy.conf import settings

settings.overrides['DOWNLOADER_MIDDLEWARES'] = {
     'myproject.middlewares.RandomUserAgentMiddleware': 400,
     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}

答案 3 :(得分:3)

您可以定义自己的代理中间件,这样的内容很简单:

from scrapy.contrib.downloadermiddleware import HttpProxyMiddleware

class ConditionalProxyMiddleware(HttpProxyMiddleware):
    def process_request(self, request, spider):
        if getattr(spider, 'use_proxy', None):
            return super(ConditionalProxyMiddleware, self).process_request(request, spider)

然后在要启用代理的蜘蛛中定义属性use_proxy = True。不要忘记禁用默认代理中间件并启用修改后的中间件。

答案 4 :(得分:-2)

为什么不使用两个项目而不是一个?

让我们用proj1proj2命名这两个项目。在proj1的{​​{1}}中,输入以下设置:

settings.py

HTTP_PROXY = 'http://127.0.0.1:8123' DOWNLOADER_MIDDLEWARES = { 'myproject.middlewares.RandomUserAgentMiddleware': 400, 'myproject.middlewares.ProxyMiddleware': 410, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None } 的{​​{1}}中,设置以下设置:

proj2