使用SeleniumRequest时遇到棘手的硒问题

时间:2019-05-21 21:23:32

标签: python python-3.x selenium web-scraping scrapy

我编写了一个非常小的脚本,使用 scrapy-selenium 库,使用scrapy结合硒,从网页中解析了不同餐厅的名称。

我的settings.py文件包含:

from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which("chromedriver")

我的Spider包含:(在crawlerprocess中使用的中间件参考)

import scrapy
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess
from scrapy_selenium import SeleniumRequest

class YPageSpider(scrapy.Spider):
    name = "yellowpages"
    link = 'https://www.yellowpages.com/search?search_terms=Pizza+Hut&geo_location_terms=San+Francisco%2C+CA'

    def start_requests(self):
        yield SeleniumRequest(url=self.link,callback=self.parse)

    def parse(self,response):
        for elem in response.css(".v-card .info a.business-name::attr(href)").getall():
            yield {"links":elem}

if __name__ == '__main__':
    settings = get_project_settings()
    settings['DOWNLOADER_MIDDLEWARES'] = {'scrapy_selenium.SeleniumMiddleware':800}
    c = CrawlerProcess(settings)
    c.crawl(YPageSpider)
    c.start()

但是,当我运行脚本并浏览downloader middlewares列表时,可以看到scrapy_selenium.SeleniumMiddleware引用没有被激活。

如何使它成功运行?

我得到的追踪:

2019-05-22 18:03:57 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: proxyspider)
2019-05-22 18:03:57 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, Platform Windows-7-6.1.7601-SP1
2019-05-22 18:03:57 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'proxyspider', 'NEWSPIDER_MODULE': 'proxyspider.spiders', 'SPIDER_MODULES': ['proxyspider.spiders']}
2019-05-22 18:03:57 [scrapy.extensions.telnet] INFO: Telnet Password: f7cd144cc88f20f6
2019-05-22 18:03:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
Unhandled error in Deferred:
2019-05-22 18:03:57 [twisted] CRITICAL: Unhandled error in Deferred:

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\crawler.py", line 172, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\crawler.py", line 176, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\defer.py", line 1613, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\defer.py", line 1529, in _cancellableInlineCallbacks
    _inlineCallbacks(None, g, status)
--- <exception caught here> ---
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\twisted\internet\defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\core\engine.py", line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\core\downloader\__init__.py", line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\middleware.py", line 53, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\middleware.py", line 35, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy\utils\misc.py", line 140, in create_instance
    return objcls.from_crawler(crawler, *args, **kwargs)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy_selenium\middlewares.py", line 71, in from_crawler
    browser_executable_path=browser_executable_path
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy_selenium\middlewares.py", line 43, in __init__
    for argument in driver_arguments:
builtins.TypeError: 'NoneType' object is not iterable

Full traceback available here

1 个答案:

答案 0 :(得分:2)

  

File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\scrapy_selenium\middlewares.py" line 43, in __init__ for argument in driver_arguments: builtins.TypeError: 'NoneType' object is not >iterable

根据该line 43的github资料,您的应用程序尝试从'SELENIUM_DRIVER_ARGUMENTS'设置中读取数据,这是硒中间件所必需的,并且未在您的代码中显示。