Scrapy不会改变代理

时间:2016-02-23 13:16:57

标签: python proxy scrapy

我尝试使用Scrapy测试代理时有一个问题。我想用httpbin.org检查代理,并使用crawler:

class CheckerSpider(scrapy.Spider):
    name = "checker"
    start_urls = (
        'https://www.httpbin.org/ip'
    )
    connection = get_connection()

    def start_requests(self):

        with self.connection.cursor() as cursor:
            limit = int((datetime.now() - datetime(1970, 1, 1)).total_seconds()) - 3600
            q = """ SELECT *
                    FROM {}
                    WHERE active = 1 AND last_checked <= {} OR last_checked IS NULL;""".format(DB_TABLE, limit)
            cursor.execute(q)
            proxy_list = cursor.fetchall()

        for proxy in proxy_list[:15]:
            word = get_random_word()
            req = scrapy.Request(self.start_urls, self.check_proxy, dont_filter=True)
            req.meta['proxy'] = 'https://{}:8080'.format(proxy['ip'])
            req.meta['item'] = proxy
            user_pass = base64.encodestring('{}:{}'.format(PROXY_USER, PROXY_PASSWORD))
            req.headers['Proxy-Authorization'] = 'Basic {}'.format(user_pass)
            req.headers['User-Agent'] = get_user_agent()
            yield req

    def check_proxy(self, response):
        print response.request.meta['proxy']
        print response.meta['item']['ip']
        print response.body

但是当我对它进行测试时,我发现Scrapy只用5个代理连接到url,然后就没有改变它。示例输出(只是消息IP):

2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.130:8080
192.168.100.130
{
  "origin": "192.168.100.130"
}

2016-02-23 14:54:36 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.131:8080
192.168.100.131
{
  "origin": "192.168.100.131"
}
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.132:8080
192.168.100.132
{
  "origin": "192.168.100.132"
}

# Here Scrapy used wrong proxy to connect to site.
2016-02-23 14:54:37 [scrapy] DEBUG: Crawled (200) <GET https://www.httpbin.org/ip> (referer: None)
https://192.168.100.134:8080
192.168.100.134
{
  "origin": "192.168.100.130"
}

可能是我犯了错误?任何的想法?谢谢。

UPD: 实际上,现在我正在使用中间件来添加代理请求。我把它放在中间件中的顺序中:

DOWNLOADER_MIDDLEWARES = {
    'checker.middlewares.ProxyCheckMiddleware': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

但我有同样的结果。这是我添加代理的自定义中间件:

class ProxyCheckMiddleware(object):

    def process_request(self, request, spider):
        if 'proxy' not in request.meta:
            request.meta['proxy'] = 'https://{}:8080'.format(request.meta['item']['ip'])
            request.meta['handle_httpstatus_list'] = [302, 503]
            user_pass = base64.encodestring('{}:{}'.format(PROXY_USER, PROXY_PASSWORD))
            request.headers['Proxy-Authorization'] = 'Basic {}'.format(user_pass)

UPD。 到目前为止,似乎是Scrapy中的一个错误。看看这里的对话:https://github.com/scrapy/scrapy/issues/1807

3 个答案:

答案 0 :(得分:1)

在故事的最后,这是Scrapy中的一个错误,它在版本1.1.0中修复(参见conversation)。非常感谢redapplerverbitsky寻求帮助!

答案 1 :(得分:0)

您是否尝试过this

按照设置说明,使用指定格式的代理列表的文本文件,并通过它运行请求。它随机化所使用的代理并丢弃在一定次数的尝试后失败的代理。可以强烈推荐它,目前使用hidemyass.com代理列表

答案 2 :(得分:0)

middlewares.py文件中试用此ProxyMiddleware。

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        request.meta['proxy'] == 'https://{}:8080'.format(request.meta.get('item').get('ip'))

        # If the proxy needs auth (you will also need to import base64 
        # proxy_auth = "username:password"
        # encoded_auth = base64.encodestring(proxy_auth)

        # request.headers['Proxy-Authorization'] = 'Basic ' + encoded_auth
        return request

settings.py文件中:

DOWNLOADER_MIDDLEWARES = {
    'checker.middlewares.ProxyMiddleware': 100,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110
}