如何更改scrapy蜘蛛中的User_AGENT?

时间:2015-10-30 20:56:13

标签: python scrapy tor

我写了一只蜘蛛从http://ip.42.pl/raw通过PROXY.获取我的IP。这是我的第一个蜘蛛。 我想更改user_agent。 我从本教程中获得了信息http://blog.privatenode.in/torifying-scrapy-project-on-ubuntu

我完成了本教程中的所有步骤,这是我的代码。

settings.py

BOT_NAME = 'CheckIP'

SPIDER_MODULES = ['CheckIP.spiders']
NEWSPIDER_MODULE = 'CheckIP.spiders'

USER_AGENT_LIST = ['Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3',
'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'Mozilla/5.0 (Linux; U; Android 4.0.3; de-ch; HTC Sensation Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'Mozilla/5.0 (Linux; U; Android 2.3; en-us) AppleWebKit/999+ (KHTML, like Gecko) Safari/999.9',
'Mozilla/5.0 (Linux; U; Android 2.3.5; zh-cn; HTC_IncredibleS_S710e Build/GRJ90) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'
    ]

HTTP_PROXY = 'http://127.0.0.1:8123'

DOWNLOADER_MIDDLEWARES = {
    'CheckIP.middlewares.RandomUserAgentMiddleware': 400,
    'CheckIP.middlewares.ProxyMiddleware': 410,
    'CheckIP.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

middleware.py

import random
from scrapy.conf import settings
from scrapy import log


class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        ua = random.choice(settings.get('USER_AGENT_LIST'))
        if ua:
            request.headers.setdefault('User-Agent', ua)
            #this is just to check which user agent is being used for request
            spider.log(
                u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request),
                level=log.DEBUG
            )


class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = settings.get('HTTP_PROXY')

checkip.py

import time
from scrapy.spider import Spider
from scrapy.http import Request

class CheckIpSpider(Spider):
    name = 'checkip'
    allowed_domains = ["ip.42.pl"]
    url = "http://ip.42.pl/raw"

    def start_requests(self):
            yield Request(self.url, callback=self.parse)

    def parse(self, response):
        now = time.strftime("%c")
        ip = now+"-"+response.body+"\n"
        with open('ips.txt', 'a') as f:
             f.write(ip)

这是USER_AGENT

的返回信息
2015-10-30 22:24:20+0200 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-10-30 22:24:20+0200 [checkip] DEBUG: User-Agent: Scrapy/0.24.4 (+http://scrapy.org) <GET http://ip.42.pl/raw>

用户代理:Scrapy / 0.24.4(+ http://scrapy.org

当我在请求中手动添加标题时,一切正常。

   def start_requests(self):
        yield Request(self.url, callback=self.parse, headers={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})

这是在带有

的控制台中返回的结果
2015-10-30 22:50:32+0200 [checkip] DEBUG: User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3 <GET http://ip.42.pl/raw>

如何在我的蜘蛛中使用USER_AGENT_LIST?

2 个答案:

答案 0 :(得分:6)

if you don't need a random user_agent, you can just put USER_AGENT on your settings file, like:

settings.py:

...
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
...

No need for the middleware. But if you want to really randomly select a user_agent, first make sure on scrapy logs that RandomUserAgentMiddleware is being used, you should check for something like this on your logs:

Enabled downloader middlewares:
[
    ...
    'CheckIP.middlewares.RandomUserAgentMiddleware',
    ...
]

check that CheckIP.middlewares is the path to that middleware.

Now maybe the settings are being incorrectly loaded on the middleware, I would recommend to use the from_crawler method to load this:

Class RandomUserAgentMiddleware(object):
    def __init__(self, settings):
        self.settings = settings

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        o = cls(settings, crawler.stats)
        return o

now use self.settings.get('USER_AGENT_LIST') for getting what you want inside the process_request method.

Also please update your scrapy version, looks like you are using 0.24 while it already passed 1.0.

答案 1 :(得分:2)

在scrapy 1.0.5中,您可以通过定义属性&#39; user_agent&#39;来为每个蜘蛛设置用户代理。在Spider中或通过设置USER_AGENT在所有蜘蛛之间共享用户代理。 UserAgentMiddleware从USER_AGENT设置获取用户代理,如果Spider中存在user_agent属性,则在请求标头中覆盖它。

您还可以编写自己的UserAgentMiddleware,在响应头中随机分配用户代理,并将优先级设置为小于400。