Scrapy - 正确使用middlewares.py

时间:2015-10-12 12:57:50

标签: python-2.7 scrapy scrapy-spider

请有人帮助我找到我缺少的地方,找到我的代码抛出的 exceptions.ImportError:没有名为middlewares的模块的解决方案。

我的文件夹结构是:

enter image description here

以下是settings.py中的 DOWNLOADER_MIDDLEWARES

DOWNLOADER_MIDDLEWARES = {
    'IpRotation.middleware.DmozSpider': 543,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
    'IpRotation.ProxyMiddleware.ProxyMiddleware': 800,
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware' : None,
    'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware':350
}

My Spider计划:

import scrapy    
class DmozSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
        start_urls = [
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
        ]

        def parse(self, response):
            filename = response.url.split("/")[-2] + '.html'
            with open(filename, 'wb') as f:
                f.write(response.body)

我的自定义UserAgentMiddleware.py:

import logging
import random
import scrapy
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        user_agent_list = [....]
        ua = random.choice(user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)   
            spider.log(
                u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request)

我的自定义IPRotationMiddleWare.py:

import random
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware

class ProxyMiddleware(HttpProxyMiddleware):
    def __init__(self, proxy_ip=''):
        self.proxy_ip = proxy_ip

    def process_request(self,request,spider):
        ip = random.choice(self.proxy_list)
        if ip:
            request.meta['proxy']= ip


    proxy_list = [.......]

我无法找到名为exception的中间件的问题。 spidermiddleware和downloadermiddleware之间有什么区别。

抛出错误:

TypeError: argument of type 'NoneType' is not iterable
2015-10-12 18:29:34 [scrapy] ERROR: Error downloading <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/>
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\endpoints.py", line 542, in connect
    timeout=self._timeout, bindAddress=self._bindAddress)
  File "C:\Python27\lib\site-packages\twisted\internet\posixbase.py", line 482, in connectTCP
    c = tcp.Connector(host, port, factory, timeout, bindAddress, self)
  File "C:\Python27\lib\site-packages\twisted\internet\tcp.py", line 1165, in __init__
    if abstract.isIPv6Address(host):
  File "C:\Python27\lib\site-packages\twisted\internet\abstract.py", line 522, in isIPv6Address
    if '%' in addr:
TypeError: argument of type 'NoneType' is not iterable

0 个答案:

没有答案