Python - 无法在Scrapy中动态旋转userAgent

时间:2015-10-09 11:07:35

标签: python scrapy user-agent

我凌驾于scrapy模块的默认实现 HttpProxyMiddleware UserAgentMiddleware ,我自己的scrapy实现会轮换用户代理和IP地址,后者从中随机选取值列表提供。 IP正在针对每个请求而不是用户代理进行更改。我无法弄清楚原因。

这是我的类

的实现

RotateUserAgentMiddleware

    class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)
            # Add desired logging message here.
            spider.log(
                u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request)
                      )

ProxyMiddleware

class ProxyMiddleware(HttpProxyMiddleware):
    def __init__(self, proxy_ip=''):
        self.proxy_ip = proxy_ip

    def process_request(self,request,spider):
        ip = random.choice(self.proxy_list)
        if ip:

            request.meta['proxy'] = ip
            print(request.meta)
        return request

settings.py Downloader_Middleware 所做的更改;

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
    'IpRotation.ProxyMiddleware.ProxyMiddleware': 800,
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware' : None,
    'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware':790
}

在我的控制台上打印每个请求 IP 用户代理值:

    2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '195.*.*.*:3120'}
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '195.*.*.*:3120'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '200.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '213.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '200.*.*.*:80'}
2015-10-09 15:51:48 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '58.*.*.*:80'}

没有更改settings.py中的USER_AGENT,因为我必须随机分配值:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'IPProxy (+http://www.yourdomain.com)'

在整个项目中,我不清楚的地方是将值分配给Downloader_Middleware。没有人说scrapy忽略了这个课程但整数说的是什么?请有人帮助我。

1 个答案:

答案 0 :(得分:1)

将Downloader_Middleware中的'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware'的值更改为小于400.