尝试使用多个用户代理时,scrapy中的middlewares.py未按预期执行

时间:2015-12-22 21:35:35

标签: python web-scraping scrapy middleware

我正在尝试在我的scrapy项目中使用多个代理。我找到了middlewares.py here的这个脚本:

import random
from scrapy.conf import settings
from myScrape.settings import USER_AGENT_LIST
import logging

class RandomUserAgentMiddleware(object):

    def process_request(self, request, spider):
        ua  = random.choice(USER_AGENT_LIST)
        print('ua = %s' %ua)
        if ua:
            request.headers.setdefault('User-Agent', ua)
            # check which ua is used
            logging.debug(u'\n>>>>> User-Agent: %s\n' %request.headers)

USER_AGENT_LIST位于settings.py

USER_AGENT_LIST = [
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:43.0) Gecko/20100101 Firefox/43.0',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) \
        Chrome/16.0.912.36 Safari/535.7',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 \
        (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]

DOWNLOADER_MIDDLEWARES = {
    'myScrape.middlewares.RandomUserAgentMiddleware': 400,
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
    # Disable compression middleware, so the actual HTML pages are cached
}

但它没有像我期望的那样起作用。我仍然在抓取时看到Scrapy用户代理。调用middlewares.py中的打印功能并显示正确的ua,但是日志输出提供了Scrapy代理。

它是如何工作的?我是否需要从蜘蛛脚本中以某种方式调用它?

1 个答案:

答案 0 :(得分:0)

正如eLRuLL所指出的,这是一个错字。我错过了downloadermiddlewares上的's',表示UserAgentMiddleware

的正确路径