Scrapy和代理

时间:2011-01-17 06:17:02

标签: python scrapy

如何利用python web-scraping框架Scrapy利用代理支持?

9 个答案:

答案 0 :(得分:43)

来自Scrapy FAQ

  

Scrapy是否适用于HTTP代理?

     

是。通过HTTP代理下载器中间件提供对HTTP代理的支持(自Scrapy 0.8起)。请参阅HttpProxyMiddleware

使用代理的最简单方法是设置环境变量http_proxy。如何完成取决于你的shell。

C:\>set http_proxy=http://proxy:port
csh% setenv http_proxy http://proxy:port
sh$ export http_proxy=http://proxy:port

如果您想使用https代理并访问https网址,请设置环境变量http_proxy,您应该按照以下步骤操作,

C:\>set https_proxy=https://proxy:port
csh% setenv https_proxy https://proxy:port
sh$ export https_proxy=https://proxy:port

答案 1 :(得分:41)

单一代理

  1. HttpProxyMiddleware中启用settings.py,如下所示:

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1
    }
    
  2. 通过request.meta传递代理请求:

    request = Request(url="http://example.com")
    request.meta['proxy'] = "host:port"
    yield request
    
  3. 如果您有地址池,也可以随机选择代理地址。像这样:

    多个代理

    class MySpider(BaseSpider):
        name = "my_spider"
        def __init__(self, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            self.proxy_pool = ['proxy_address1', 'proxy_address2', ..., 'proxy_addressN']
    
        def parse(self, response):
            ...parse code...
            if something:
                yield self.get_request(url)
    
        def get_request(self, url):
            req = Request(url=url)
            if self.proxy_pool:
                req.meta['proxy'] = random.choice(self.proxy_pool)
            return req
    

答案 2 :(得分:25)

1 - 创建一个名为“middlewares.py”的新文件并将其保存在scrapy项目中,并将以下代码添加到其中。

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2 - 打开项目的配置文件(./project_name/settings.py)并添加以下代码

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'project_name.middlewares.ProxyMiddleware': 100,
}

现在,您的请求应该由此代理传递。简单,不是吗?

答案 3 :(得分:9)

那将是:

  

export http_proxy = http:// user:password @ proxy:port

答案 4 :(得分:4)

有人[1]编写了很好的中间件:https://github.com/aivarsk/scrapy-proxies" Scrapy代理中间件"

答案 5 :(得分:3)

在Windows中,我将几个先前的答案放在一起并且有效。我只是做了:

C:>  set http_proxy = http://username:password@proxy:port

然后我启动了我的程序:

C:/.../RightFolder> scrapy crawl dmoz

其中“dmzo”是程序名称(我正在编写它,因为它是你在互联网教程中找到的那个,如果你在这里,你可能已经从教程开始了。)

答案 6 :(得分:2)

由于我在/ etc / environment中设置环境遇到了麻烦,这就是我在蜘蛛(Python)中添加的内容:

os.environ["http_proxy"] = "http://localhost:12345"

答案 7 :(得分:2)

这就是我要做的

方法一:

像这样创建一个下载中间件

class ProxiesDownloaderMiddleware(object):

    def process_request(self, request, spider):
        
        request.meta['proxy'] = 'user:pass@host:port'

并在 settings.py

中启用
DOWNLOADER_MIDDLEWARES: {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'my_scrapy_project_directory.middlewares.ProxiesDownloaderMiddleware': 600,
},

就是这样,现在代理将应用于每个请求

方法二:

只需在 HttpProxyMiddleware 中启用 settings.py,然后为每个请求执行此操作

yield Request(url=..., meta={'proxy': 'user:pass@host:port'})

答案 8 :(得分:0)

我建议您使用诸如scrapy-proxies之类的中间件。您可以为所有请求旋转代理,过滤不良代理或使用单个代理。另外,使用中间件可以避免每次运行都设置代理的麻烦。

这直接来自GitHub README。

  • 安装scrapy-rotating-proxy库

    pip install scrapy_proxies

  • 在您的settings.py中添加以下设置

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

您可以在此处更改重试时间,设置单个或轮换代理

  • 然后将您的代理添加到这样的list.txt文件中
http://host1:port
http://username:password@host2:port
http://host3:port

此后,您对该项目的所有请求将通过代理发送。代理将为每个请求随机轮换。不会影响并发。

注意:如果您不想使用代理。您只需注释scrapy_proxy中间件行。

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
#    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

爬行愉快!