连接池已满,并通过Selenium和Python放弃了与ThreadPoolExecutor和多个无头浏览器的连接

时间:2018-12-05 21:31:31

标签: python selenium threadpool threadpoolexecutor urllib3

我正在使用selenium==3.141.0python 3.6.7chromedriver 2.44编写一些自动化软件。

大多数逻辑可以由单个浏览器实例执行,但是在某些情况下,我必须启动10-20个实例才能具有不错的执行速度。

一旦涉及到ThreadPoolExecutor执行的部分,浏览器的交互就会引发此错误:

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

浏览器设置:

def init_chromedriver(cls):
    try:
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument(f"user-agent={Utils.get_random_browser_agent()}")
        prefs = {"profile.managed_default_content_settings.images": 2}
        chrome_options.add_experimental_option("prefs", prefs)

        driver = webdriver.Chrome(driver_paths['chrome'],
                                       chrome_options=chrome_options,
                                       service_args=['--verbose', f'--log-path={bundle_dir}/selenium/chromedriver.log'])
        driver.implicitly_wait(10)

        return driver
    except Exception as e:
        logger.error(e)

相关代码:

ProfileParser实例化Web驱动程序并执行一些页面交互。我认为交互作用本身是不相关的,因为没有ThreadPoolExecutor,一切都可以正常工作。 但是,简而言之:

class ProfileParser(object):
    def __init__(self, acc):
        self.driver = Utils.init_chromedriver()
    def __exit__(self, exc_type, exc_val, exc_tb):
        Utils.shutdown_chromedriver(self.driver)
        self.driver = None

    collect_user_info(post_url)
           self.driver.get(post_url)
           profile_url = self.driver.find_element_by_xpath('xpath_here')]').get_attribute('href')

ThreadPoolExecutor中运行时,以上错误出现在此时self.driver.find_element_by_xpathself.driver.get

这有效:

with ProfileParser(acc) as pparser:
        pparser.collect_user_info(posts[0])

这些选项不起作用:connectionpool errors

futures = []
#one worker, one future
with ThreadPoolExecutor(max_workers=1) as executor:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, posts[0]))

#10 workers, multiple futures
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        with ProfileParser(acc) as pparser:
            futures.append(executor.submit(pparser.collect_user_info, p))

更新:

我发现了一个临时解决方案(它不会使第一个问题无效)-实例化webdriver类之外的ProfileParser。不知道为什么它起作用,但是最初的却不起作用。我想某些语言方面的原因? 感谢您的回答,但是看来ThreadPoolExecutor max_workers的限制并不是问题所在-如您所见,在我尝试提交单个实例的一个选项中,它仍然没有工作。

当前解决方法:

futures = []
with ThreadPoolExecutor(max_workers=10) as executor:
    for p in posts:
        driver = Utils.init_chromedriver()
        futures.append({
            'future': executor.submit(collect_user_info, driver, acc, p),
            'driver': driver
        })

for f in futures:
    f['future'].done()
    Utils.shutdown_chromedriver(f['driver'])

2 个答案:

答案 0 :(得分:3)

此错误消息...

WARNING|05/Dec/2018 17:33:11|connectionpool|_put_conn|274|Connection pool is full, discarding connection: 127.0.0.1
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))': /session/119df5b95710793a0421c13ec3a83847/url
WARNING|05/Dec/2018 17:33:11|connectionpool|urlopen|662|Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcee7ada048>: Failed to establish a new connection: [Errno 111] Connection refused',)': /session/119df5b95710793a0421c13ec3a83847/url

...似乎是urllib3的连接池中的一个问题,当在 connectionpool.py 中执行def _put_conn(self, conn)方法时,引发了这些警告 strong>。

def _put_conn(self, conn):
    """
    Put a connection back into the pool.

    :param conn:
        Connection object for the current host and port as returned by
        :meth:`._new_conn` or :meth:`._get_conn`.

    If the pool is already full, the connection is closed and discarded
    because we exceeded maxsize. If connections are discarded frequently,
    then maxsize should be increased.

    If the pool is closed, then the connection will be closed and discarded.
    """
    try:
        self.pool.put(conn, block=False)
        return  # Everything is dandy, done.
    except AttributeError:
        # self.pool is None.
        pass
    except queue.Full:
        # This should never happen if self.block == True
        log.warning(
            "Connection pool is full, discarding connection: %s",
            self.host)

    # Connection never got put back into the pool, close it.
    if conn:
        conn.close()

ThreadPoolExecutor

ThreadPoolExecutorExecutor的子类,它使用线程池异步执行调用。当与Future相关联的可调用对象等待另一个Future的结果时,就会发生死锁。

class concurrent.futures.ThreadPoolExecutor(max_workers=None, thread_name_prefix='', initializer=None, initargs=())
  • Executor子类,该子类最多使用max_workers线程池来异步执行调用。
  • 初始化器是一个可选的可调用对象,它在每个工作线程的开始处被调用; initargs是传递给初始化程序的参数的元组。如果初始化程序引发异常,则所有当前挂起的作业将引发BrokenThreadPool,以及任何尝试向池中提交更多作业的尝试。
  • 从版本3.5开始:如果max_workers为None或未给出,它将默认为计算机上的处理器数量乘以5,假设ThreadPoolExecutor通常用于重叠I / O而不是CPU工作和数量的工人人数应高于ProcessPoolExecutor的工人人数。
  • 从3.6版开始:添加了thread_name_prefix参数以允许用户控制线程。由池创建的辅助线程的线程名称将简化调试。
  • 从3.7版开始:添加了初始化程序和initargs参数。

根据您的问题,当您尝试启动10-20个实例时, 10 默认连接池大小在您的情况下似乎不够,这在硬编码中adapters.py

此外,讨论Getting error: Connection pool is full, discarding connection中的@EdLeafe提到:

  

看起来在请求代码中,没有对象是正常的。如果_get_conn()从池中,则仅创建一个新连接。但是,它应该以所有这些None对象开头,并且_put_conn()不够聪明,无法用连接替换None,这似乎很奇怪。

但是,合并Add pool size parameter to client constructor已解决此问题。

解决方案

增加 10 默认连接池大小(先前在adapters.py中进行了硬编码,现在可以对其进行配置了)。


更新

根据您的评论更新 ...提交一个实例,结果是相同的... 。根据讨论Getting error: Connection pool is full, discarding connection中的@ meferguson84:

  

我进入了代码,直到安装适配器只是为了适应池的大小,然后看它是否有所不同。我发现队列中塞满了NoneType对象,实际的上载连接是列表中的最后一项。列表长10个项目(这很有意义)。没有什么意义的是,该池的unfinished_tasks参数为11。当队列本身只有11个项目时,怎么可能呢?另外,队列中充满了NoneType对象是否正常,我们正在使用的连接是列表中的最后一项?

这听起来也可能是您的用例中的一个可能原因。听起来可能有些多余,但您仍然可以执行以下几个临时步骤:

答案 1 :(得分:0)

请查看您的错误

ProtocolError('Connection aborted.', 
  RemoteDisconnected('Remote end closed connection without response',))

'NewConnectionError('<urllib3.connection.HTTPConnection object at >: 
   Failed to establish a new connection: [Errno 111] Connection refused',)':

出现错误是因为您执行多个连接的速度太快,可能是服务器关闭或服务器阻止了您的请求。