我在两个AWS EC2实例上运行的集群环境中运行Airflow。一个给主人,一个给工人。但是,工作程序节点在运行“ $ airflow worker”时会定期抛出此错误:
[2018-08-09 16:15:43,553] {jobs.py:2574} WARNING - The recorded hostname ip-1.2.3.4 does not match this instance's hostname ip-1.2.3.4.eco.tanonprod.comanyname.io
Traceback (most recent call last):
File "/usr/bin/airflow", line 27, in <module>
args.func(args)
File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 387, in run
run_job.run()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 198, in run
self._execute()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2527, in _execute
self.heartbeat()
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 182, in heartbeat
self.heartbeat_callback(session=session)
File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 50, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/jobs.py", line 2575, in heartbeat_callback
raise AirflowException("Hostname of job runner does not match")
airflow.exceptions.AirflowException: Hostname of job runner does not match
[2018-08-09 16:15:43,671] {celery_executor.py:54} ERROR - Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.
[2018-08-09 16:15:43,681: ERROR/ForkPoolWorker-30] Task airflow.executors.celery_executor.execute_command[875a4da9-582e-4c10-92aa-5407f3b46d5f] raised unexpected: AirflowException('Celery command failed',)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 52, in execute_command
subprocess.check_call(command, shell=True)
File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'airflow run arl_source_emr_test_dag runEmrStep2WaiterTask 2018-08-07T00:00:00 --local -sd /var/lib/airflow/dags/arl_source_emr_test_dag.py' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/lib/python3.6/dist-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/airflow/executors/celery_executor.py", line 55, in execute_command
raise AirflowException('Celery command failed')
airflow.exceptions.AirflowException: Celery command failed
发生此错误时,该任务被标记为Airflow失败,因此当任务中没有任何实际错误时,我的DAG就会失败。
我将Redis用作队列,将postgreSQL用作元数据库。两者都是外部的AWS服务。我正在公司环境中运行所有这些程序,这就是为什么服务器的全名是ip-1.2.3.4.eco.tanonprod.comanyname.io
的原因。似乎它希望在某个地方使用此全名,但我不知道该在哪里修复此值,以使其获得ip-1.2.3.4.eco.tanonprod.comanyname.io
而不是ip-1.2.3.4
。
关于此问题的真正奇怪的是,它并不总是发生。当我运行DAG时,它似乎只是偶尔偶尔发生一次。我的所有DAG上也偶尔出现这种情况,因此它不只是一个DAG。我觉得很奇怪,尽管它是偶发的,因为这意味着其他任务运行正在处理IP地址,而这一切都很好。
注意:出于隐私原因,我已将真实IP地址更改为1.2.3.4。
答案:
https://github.com/apache/incubator-airflow/pull/2484
这正是我遇到的问题,AWS EC2-Instance上的其他Airflow用户也遇到了此问题。
答案 0 :(得分:1)
在任务实例运行时设置主机名,并将其设置为self.hostname = socket.getfqdn()
,其中socket是python软件包import socket
。
触发此错误的比较是:
fqdn = socket.getfqdn()
if fqdn != ti.hostname:
logging.warning("The recorded hostname {ti.hostname} "
"does not match this instance's hostname "
"{fqdn}".format(**locals()))
raise AirflowException("Hostname of job runner does not match")
似乎在工作程序运行时ec2实例上的主机名正在更改。也许尝试按照此处https://forums.aws.amazon.com/thread.jspa?threadID=246906所述手动设置主机名,看看是否仍然有效。
答案 1 :(得分:1)
我在Mac上也有类似的问题。它修复了在hostname_callable = socket:gethostname
中设置airflow.cfg
的问题。
答案 2 :(得分:0)
就个人而言,在Mac上运行时,我发现在长时间运行Mac时睡眠时,也会遇到类似的错误。解决方案是进入“系统偏好设置”->“节能器”,然后选中“在显示器关闭时防止计算机自动进入睡眠状态。”