Scrapy:如何编写UserAgentMiddleware?

时间:2017-09-02 17:24:18

标签: scrapy

我想写一个用于scrapy的UserAgentMiddleware,
文档说:

  

允许蜘蛛覆盖默认用户代理的中间件。   为了使蜘蛛覆盖默认用户代理,必须设置其user_agent属性。

文档: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.useragent

但是没有一个例子,我不知道如何写它 有什么建议吗?

2 个答案:

答案 0 :(得分:3)

您在安装scrapy路径中查看它

  

/Users/tarun.lalwani/.virtualenvs/project/lib/python3.6/site-packages/scrapy/downloadermiddlewares/useragent.py

“”“为每个蜘蛛设置User-Agent标头或使用设置中的默认值”“”

from scrapy import signals


class UserAgentMiddleware(object):
    """This middleware allows spiders to override the user_agent"""

    def __init__(self, user_agent='Scrapy'):
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.settings['USER_AGENT'])
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.user_agent = getattr(spider, 'user_agent', self.user_agent)

    def process_request(self, request, spider):
        if self.user_agent:
            request.headers.setdefault(b'User-Agent', self.user_agent)

您可以在下面看到设置随机用户代理

的示例

https://github.com/alecxe/scrapy-fake-useragent/blob/master/scrapy_fake_useragent/middleware.py

答案 1 :(得分:0)

首先访问一些网站,并获取一些最新的用户代理。然后在您的标准中间件中执行以下操作。这是您设置自己的代理设置的相同位置。从文本文件中获取随机UA,并将其放在标题中。如此草率地显示了一个示例,您想在顶部导入随机数,并确保在完成操作后更靠近useragents.txt。我可能只是将它们加载到文档顶部的列表中。

class GdataDownloaderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        user_agents = open('useragents.txt', 'r')
        user_agents = user_agents.readlines()
        import random
        user_agent = random.choice(user_agents)
        request.headers.setdefault(b'User-Agent', user_agent)

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)