Airflow Scheduler不断崩溃,数据库连接错误(Google Composer)

时间:2018-06-28 09:00:13

标签: airflow airflow-scheduler google-cloud-composer

我使用Google Composer已有一段时间(composer-0.5.2-airflow-1.9.0),并且Airflow Scheduler出现了一些问题。调度程序容器有时会崩溃,并且可能陷入无法启动任何新任务(数据库连接错误)的锁定状态,因此我必须重新创建整个Composer环境。这次有一个CrashLoopBackOff,并且调度程序窗格无法重新启动。该错误与我之前遇到的错误非常相似。这是来自Stackdriver的追溯:

Traceback (most recent call last):
  File "/usr/local/bin/airflow", line 27, in <module>
    args.func(args)
  File "/usr/local/lib/python2.7/site-packages/airflow/bin/cli.py", line 826, in scheduler
    job.run()
  File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 198, in run
    self._execute()
  File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 1549, in _execute
    self._execute_helper(processor_manager)
  File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 1594, in _execute_helper
    self.reset_state_for_orphaned_tasks(session=session)
  File "/usr/local/lib/python2.7/site-packages/airflow/utils/db.py", line 50, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/airflow/jobs.py", line 266, in reset_state_for_orphaned_tasks
    .filter(or_(*filter_for_tis), TI.state.in_(resettable_states))
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2783, in all
    return list(self)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2935, in __iter__
    return self._execute_and_instances(context)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2958, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 948, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
    context)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1413, in _handle_dbapi_exception
    exc_info
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
    context)
  File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 508, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 250, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 50, in defaulterrorhandler
    raise errorvalue
sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') [SQL: u'SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.max_tries AS task_instance_max_tries, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, task_instance.pid AS task_instance_pid \nFROM task_instance \nWHERE (task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s OR task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.execution_date = %s) AND task_instance.state IN (%s, %s) FOR UPDATE'] [parameters: ('pb_write_event_tables_v2_dev2', 'check_table_chest_progressed', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_name_changed', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_registered', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_unit_leveled_up', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_virtual_currency_earned', datetime.datetime(2018, 6, 26, 8, 0), 'pb_write_event_tables_v2_dev2', 'check_table_virtual_currency_spent', datetime.datetime(2018, 6, 26, 8, 0), u'scheduled', u'queued')] (Background on this error at: http://sqlalche.me/e/e3q8)

我不了解RDBMS技术错误。但是,这是具有默认环境的开箱即用的Google Composer,所以我想知道是否还有其他人遇到类似的问题或知道发生了什么事情?我知道Composer将Google Cloud SQL用于数据库,并且显然使用(?)MySQL后端。

Airflow Scheduler图像为gcr.io/cloud-airflow-releaser/airflow-worker-scheduler-1.9.0:cloud_composer_service_2018-06-19-RC3

我必须补充一点,我没有在自制的Airflow Kubernetes安装程序中遇到此调度程序问题,但是后来我在PostgreSQL中使用了最新的Airflow版本。

1 个答案:

答案 0 :(得分:0)

这可能是由于资源不堪重负造成的:

为避免这种情况,您可以使用async DAGs负载或让环境使用更高的计算机类型。

此外,由于问题已解决,我建议使用最新版本的composer-1.10.6-airflow-1.10.6。