用无头模式的firefox改善scrapy和硒

时间:2018-01-25 06:13:16

标签: python selenium scrapy firefox-headless

我正在抓一个javascript繁重的网站,我已经设置了一个流浪的实例来检查可行性(1GB RAM)。解析几个网址后系统崩溃。我无法确定此设置的内存要求和崩溃原因。然而,我有htop并行运行并在系统崩溃之前获得了截图,附在下面。我怀疑记忆力还不够,但我不知道我需要多少。因此,我正在寻找:

  1. 我的设置的内存要求(Scrapy + selenium + fireofox -headless
  2. 崩溃原因
  3. 如何改善抓取过程
  4. 替代(scrapy,selenium,firefox)
  5. SeleniumMiddleWare:

    import os, traceback
    from shutilwhich import which
    from scrapy import signals
    from scrapy.http import HtmlResponse
    from scrapy.utils.project import get_project_settings
    from selenium import webdriver
    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    from selenium.webdriver.firefox.options import Options
    
    SELENIUM_HEADLESS = False
    
    settings = get_project_settings()
    
    class SeleniumMiddleware(object):
        driver = None
    
        @classmethod
        def from_crawler(cls, crawler):
            middleware = cls()
            crawler.signals.connect(middleware.spider_opened, signals.spider_opened)
            crawler.signals.connect(middleware.spider_closed, signals.spider_closed)
            return middleware
    
        def process_request(self, request, spider):
            if not request.meta.get('selenium'):
                return
            self.driver.get(request.url)
    
            #if setting new cookies remove old
            if len(request.cookies):
                self.driver.implicitly_wait(1)
                self.driver.delete_all_cookies()
    
            #add only desired cookies if session persistence is requested
            request_cookies=[]
    
            if request.meta.get('request_cookies'):
               request_cookies=request.meta.get('request_cookies')
    
            for cookie in request.cookies:
                 if cookie['name'] in request_cookies:
                    print ' ---- set request cookie [%s] ---- ' % cookie['name']
                    new_cookie={k: cookie[k] for k in ('name', 'value','path', 'expiry') if k in cookie}
                    self.driver.add_cookie(new_cookie)
    
            if  request.meta.get('redirect_url'):
                self.driver.get(request.meta.get('redirect_url'))
                self.driver.implicitly_wait(5)
    
            request.meta['driver'] = self.driver
    
            return HtmlResponse(self.driver.current_url, body=self.driver.page_source, encoding='utf-8', request=request)
    
        def spider_opened(self, spider):
            options=Options()
            binary= settings.get('SELENIUM_FIREFOX_BINARY') or which('firefox')
            SELENIUM_HEADLESS=settings.get('SELENIUM_HEADLESS') or False
            if SELENIUM_HEADLESS:
                print " ---- HEADLESS ----"
                options.add_argument( "--headless" )
    
            firefox_capabilities = DesiredCapabilities.FIREFOX
            firefox_capabilities['marionette'] = True
            firefox_capabilities['binary'] = binary
            try:
                self.driver = webdriver.Firefox(capabilities=firefox_capabilities, firefox_options=options)
            except Exception:
              print " ---- Unable to instantiate selenium webdriver instance ! ----"
              traceback.print_exc()
              os._exit(1)
    
        def spider_closed(self, spider):
            if self.driver:
                self.driver.close()
    

    enter image description here

1 个答案:

答案 0 :(得分:0)

快速浏览您的代码块会在def spider_closed(self, spider):中显示您正在使用self.driver.close(),如下所示:

def spider_closed(self, spider):
    if self.driver:
        self.driver.close()

根据关闭webdriver变体和webbrowser实例的最佳做法,您应调用quit()中的tearDown() {}方法。调用quit(),通过quit {"flags":["eForceQuit"]}发送/shutdown命令删除当前浏览会话,最后在EndPoint self.driver.close()上发送GET请求。

您可以在How to stop geckodriver process impacting PC memory, without calling driver.quit()?

中找到详细的讨论

因此解决方案是将 self.driver.quit() 替换为 def spider_closed(self, spider): if self.driver: self.driver.quit() ,如下所示:

PartitionKeyColumn