我有一个四节点集群的Airflow环境,对我来说几个月来一直运行良好。
ec2-instances
最近,我一直在开发一个更复杂的DAG,其中包含几十个任务,而我之前从事的工作相对较小。我不确定这是否就是我现在看到此错误的原因,但我偶尔会收到此错误:
在Airflow UI上,该任务的日志下:
psycopg2.OperationalError: FATAL: sorry, too many clients already
在Web服务器上(运行 airflow Web服务器的输出),我也遇到相同的错误:
[2018-07-23 17:43:46 -0400] [8116] [ERROR] Exception in worker process
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2158, in _wrap_pool_connect
return fn()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 403, in connect
return _ConnectionFairy._checkout(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 788, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 532, in checkout
rec = pool._do_get()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1193, in _do_get
self._dec_overflow()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 187, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1190, in _do_get
return self._create_connection()
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 350, in _create_connection
return _ConnectionRecord(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 477, in __init__
self.__connect(first_connect_check=True)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 671, in __connect
connection = pool._invoke_creator(self)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 106, in connect
return dialect.connect(*cargs, **cparams)
File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 410, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 130, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL: sorry, too many clients already
我可以通过运行sudo /etc/init.d/postgresql restart
并重新启动DAG来解决此问题,但是经过大约 次运行三遍后,我将再次开始看到错误。
我没有找到关于Airflow的任何细节,但是从other posts I've found such as this one那里,他们说这是因为我的客户(在这种情况下,我是Airflow)正在尝试建立与PostgreSQL的更多连接而不是PostgreSQL配置要处理的内容。我运行此命令发现我的PostgreSQL可以接受100个连接:
[ec2-user@ip-1-2-3-4 ~]$ sudo su
root@ip-1-2-3-4
[/home/ec2-user]# psql -U postgres
psql (9.2.24)
Type "help" for help.
postgres=# show max_connections;
max_connections
-----------------
100
(1 row)
在this solution中,帖子说我可以增加PostgreSQL最大连接数,但是我想知道是否应该在Airflow.cfg文件中设置一个值,以便可以将Airflow允许的连接大小与PoastgreSQL匹配最大连接大小。有谁知道我可以在Airflow中设置此值?以下是我认为相关的字段:
# The SqlAlchemy pool size is the maximum number of database connections
# in the pool.
sql_alchemy_pool_size = 5
# The SqlAlchemy pool recycle is the number of seconds a connection
# can be idle in the pool before it is invalidated. This config does
# not apply to sqlite.
sql_alchemy_pool_recycle = 3600
# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32
# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 32
# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128
# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 32
打开任何有关解决此问题的建议。这与我的Airflow配置有关还是PostgreSQL配置有问题?
此外,由于我正在测试新的DAG,因此有时会终止运行任务并重新开始。也许这样做会导致某些进程无法正确终止,并使死连接保持对PostgreSQL的开放状态?
答案 0 :(得分:1)
遇到类似问题。我将postgres中的max_connections
更改为10000
,并将气流配置中的sql_alchemy_pool_size
更改为1000
。现在,我可以并行运行数百个任务。
PS:我的机器具有32核和60GB内存。因此,它承担了重担。
答案 1 :(得分:0)
sql_alchemy_max_overflow:池的最大溢出大小。当检出的连接数达到 pool_size 中设置的大小时,将返回额外的连接,直至达到此限制。当这些额外的连接返回到池中时,它们会断开连接并被丢弃。因此,池允许的同时连接总数是 pool_size + max_overflow,池允许的“休眠”连接总数是 pool_size。 max_overflow 可以设置为 -1 表示没有溢出限制;并发连接总数没有限制。默认为 10。
似乎您要在 (i, j)
上设置的变量都是 import base64
decoded_payload = base64.b64decode(media.payload)
和 airflow.cfg
。您的 PostgreSQL sql_alchemy_pool_size
必须等于或大于这两个 Airflow 配置变量的总和,因为 Airflow 最多可以与您的数据库建立 sql_alchemy_max_overflow
个开放连接。