我想为一些蜘蛛启用一些http代理,并为其他蜘蛛禁用它们。
我可以这样做吗?
# settings.py
proxy_spiders = ['a1' , b2']
if spider in proxy_spider: #how to get spider name ???
HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
else:
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
如果上面的代码不起作用,还有其他建议吗?
答案 0 :(得分:33)
有点晚了,但是自1.0.0发布以来,scrapy中有一个新功能,您可以像这样覆盖每个蜘蛛的设置:
class MySpider(scrapy.Spider):
name = "my_spider"
custom_settings = {"HTTP_PROXY":'http://127.0.0.1:8123',
"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
class MySpider2(scrapy.Spider):
name = "my_spider2"
custom_settings = {"DOWNLOADER_MIDDLEWARES": {'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}}
答案 1 :(得分:10)
有一种新的更简单的方法可以做到这一点。
class MySpider(scrapy.Spider):
name = 'myspider'
custom_settings = {
'SOME_SETTING': 'some value',
}
我使用Scrapy 1.3.1
答案 2 :(得分:8)
您可以在spider.py文件中添加setting.overrides 有效的例子:
from scrapy.conf import settings
settings.overrides['DOWNLOAD_TIMEOUT'] = 300
对你来说,这样的事也应该有用
from scrapy.conf import settings
settings.overrides['DOWNLOADER_MIDDLEWARES'] = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
答案 3 :(得分:3)
您可以定义自己的代理中间件,这样的内容很简单:
from scrapy.contrib.downloadermiddleware import HttpProxyMiddleware
class ConditionalProxyMiddleware(HttpProxyMiddleware):
def process_request(self, request, spider):
if getattr(spider, 'use_proxy', None):
return super(ConditionalProxyMiddleware, self).process_request(request, spider)
然后在要启用代理的蜘蛛中定义属性use_proxy = True
。不要忘记禁用默认代理中间件并启用修改后的中间件。
答案 4 :(得分:-2)
为什么不使用两个项目而不是一个?
让我们用proj1
和proj2
命名这两个项目。在proj1
的{{1}}中,输入以下设置:
settings.py
在HTTP_PROXY = 'http://127.0.0.1:8123'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
的{{1}}中,设置以下设置:
proj2