气流日志BrokenPipeException

时间:2018-07-16 16:01:16

标签: python-3.x logging airflow broken-pipe

我正在使用集群化的Airflow环境,其中有四个用于服务器的AWS ec2实例。

ec2-instances

  • 服务器1:Web服务器,调度程序,Redis队列,PostgreSQL数据库
  • 服务器2:Web服务器
  • 服务器3:工作者
  • 服务器4:工作者

到目前为止,我的设置已经可以正常工作三个月,但偶尔会有大约一周一次的时间,当Airflow尝试记录某些内容时,我会收到“管道破裂”的消息。

*** Log file isn't local.
*** Fetching here: http://ip-1-2-3-4:8793/log/foobar/task_1/2018-07-13T00:00:00/1.log

[2018-07-16 00:00:15,521] {cli.py:374} INFO - Running on host ip-1-2-3-4
[2018-07-16 00:00:15,698] {models.py:1197} INFO - Dependencies all met for <TaskInstance: foobar.task_1 2018-07-13 00:00:00 [queued]>
[2018-07-16 00:00:15,710] {models.py:1197} INFO - Dependencies all met for <TaskInstance: foobar.task_1 2018-07-13 00:00:00 [queued]>
[2018-07-16 00:00:15,710] {models.py:1407} INFO - 
--------------------------------------------------------------------------------
Starting attempt 1 of 1
--------------------------------------------------------------------------------

[2018-07-16 00:00:15,719] {models.py:1428} INFO - Executing <Task(OmegaFileSensor): task_1> on 2018-07-13 00:00:00
[2018-07-16 00:00:15,720] {base_task_runner.py:115} INFO - Running: ['bash', '-c', 'airflow run foobar task_1 2018-07-13T00:00:00 --job_id 1320 --raw -sd DAGS_FOLDER/datalake_digitalplatform_arl_workflow_schedule_test_2.py']
[2018-07-16 00:00:16,532] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,532] {configuration.py:206} WARNING - section/key [celery/celery_ssl_active] not found in config
[2018-07-16 00:00:16,532] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,532] {default_celery.py:41} WARNING - Celery Executor will run without SSL
[2018-07-16 00:00:16,534] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,533] {__init__.py:45} INFO - Using executor CeleryExecutor
[2018-07-16 00:00:16,597] {base_task_runner.py:98} INFO - Subtask: [2018-07-16 00:00:16,597] {models.py:189} INFO - Filling up the DagBag from /home/ec2-user/airflow/dags/datalake_digitalplatform_arl_workflow_schedule_test_2.py
[2018-07-16 00:00:16,768] {cli.py:374} INFO - Running on host ip-1-2-3-4
[2018-07-16 00:16:24,931] {logging_mixin.py:84} WARNING - --- Logging error ---

[2018-07-16 00:16:24,931] {logging_mixin.py:84} WARNING - Traceback (most recent call last):

[2018-07-16 00:16:24,931] {logging_mixin.py:84} WARNING -   File "/usr/lib64/python3.6/logging/__init__.py", line 996, in emit
    self.flush()

[2018-07-16 00:16:24,932] {logging_mixin.py:84} WARNING -   File "/usr/lib64/python3.6/logging/__init__.py", line 976, in flush
    self.stream.flush()

[2018-07-16 00:16:24,932] {logging_mixin.py:84} WARNING - BrokenPipeError: [Errno 32] Broken pipe

[2018-07-16 00:16:24,932] {logging_mixin.py:84} WARNING - Call stack:

[2018-07-16 00:16:24,933] {logging_mixin.py:84} WARNING -   File "/usr/bin/airflow", line 27, in <module>
    args.func(args)

[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING -   File "/usr/local/lib/python3.6/site-packages/airflow/bin/cli.py", line 392, in run
    pool=args.pool,

[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING -   File "/usr/local/lib/python3.6/site-packages/airflow/utils/db.py", line 50, in wrapper
    result = func(*args, **kwargs)

[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING -   File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1488, in _run_raw_task
    result = task_copy.execute(context=context)

[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING -   File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 78, in execute
    while not self.poke(context):

[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING -   File "/home/ec2-user/airflow/plugins/custom_plugins.py", line 35, in poke
    directory = os.listdir(full_path)

[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING -   File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 36, in handle_timeout
    self.log.error("Process timed out")

[2018-07-16 00:16:24,934] {logging_mixin.py:84} WARNING - Message: 'Process timed out'
Arguments: ()

[2018-07-16 00:16:24,942] {models.py:1595} ERROR - Timeout
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1488, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.6/site-packages/airflow/operators/sensors.py", line 78, in execute
    while not self.poke(context):
  File "/home/ec2-user/airflow/plugins/custom_plugins.py", line 35, in poke
    directory = os.listdir(full_path)
  File "/usr/local/lib/python3.6/site-packages/airflow/utils/timeout.py", line 37, in handle_timeout
    raise AirflowTaskTimeout(self.error_message)
airflow.exceptions.AirflowTaskTimeout: Timeout
[2018-07-16 00:16:24,942] {models.py:1624} INFO - Marking task as FAILED.
[2018-07-16 00:16:24,956] {models.py:1644} ERROR - Timeout

有时错误还会提示

*** Log file isn't local.
*** Fetching here: http://ip-1-2-3-4:8793/log/foobar/task_1/2018-07-12T00:00:00/1.log
*** Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url: http://ip-1-2-3-4:8793/log/foobar/task_1/2018-07-12T00:00:00/1.log

我不确定为什么日志在大约95%的时间内都能正常工作,而在其他时间却随机出现故障。这是我Airflow.cfg文件中的日志设置,

# The folder where airflow should store its log files
# This path must be absolute
base_log_folder = /home/ec2-user/airflow/logs

# Airflow can store logs remotely in AWS S3 or Google Cloud Storage. Users
# must supply an Airflow connection id that provides access to the storage
# location.
remote_log_conn_id =
encrypt_s3_logs = False

# Logging level
logging_level = INFO

# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class =

# Log format
log_format = [%%(asctime)s] {%%(filename)s:%%(lineno)d} %%(levelname)s - %%(message)s
simple_log_format = %%(asctime)s %%(levelname)s - %%(message)s

# Name of handler to read task instance logs.
# Default to use file task handler.
task_log_reader = file.task

# Log files for the gunicorn webserver. '-' means log to stderr.
access_logfile = -
error_logfile = 

# The amount of time (in secs) webserver will wait for initial handshake
# while fetching logs from other worker machine
log_fetch_timeout_sec = 5

# When you start an airflow worker, airflow starts a tiny web server
# subprocess to serve the workers local log files to the airflow main
# web server, who then builds pages and sends them to users. This defines
# the port on which the logs are served. It needs to be unused, and open
# visible from the main web server to connect into the workers.
worker_log_server_port = 8793

# How often should stats be printed to the logs
print_stats_interval = 30

child_process_log_directory = /home/ec2-user/airflow/logs/scheduler

我想知道是否应该尝试其他记录技术,例如写入S3存储桶,或者是否有其他方法可以解决此问题。

更新

将日志写入S3不能解决此问题。而且,该错误现在更加一致(仍然是零星的)。现在大约有50%的时间在发生。我注意到的一件事是发生的任务是我的AWS EMR创建任务。启动AWS EMR集群大约需要20分钟,然后任务必须等待Spark命令在EMR集群上运行。因此,单个任务运行大约30分钟。我想知道这是否太长时间以至于无法运行Airflow任务,这就是为什么它开始无法写入日志的原因。如果是这种情况,那么我可以分解EMR任务,以便在EMR集群上有一个创建EMR的任务,然后是另一个针对Spark命令的任务。

注意:

我还在https://issues.apache.org/jira/browse/AIRFLOW-2844

上在Airflow的Jira上创建了新的故障单

1 个答案:

答案 0 :(得分:2)

这个问题是我刚刚在这里AirflowException: Celery command failed - The recorded hostname does not match this instance's hostname解决的另一个问题的症状。

我有一段时间没有看到 AirflowException:Celery命令失败,因为它出现在 airflow worker 日志中。直到我看到气流工作人员实时记录日志后,我才发现抛出错误时,我的任务中也出现了BrokenPipeException。

虽然有点奇怪。如果在工作节点上发生print("something to log") AirflowException: Celery command failed...错误,我只会看到BrokenPipeException抛出。当我将所有打印语句更改为使用import logging ... logging.info("something to log")时,我不会看到BrokenPipeException ,但是该任务仍然会由于AirflowException: Celery command failed...错误而失败。但是如果我没有在Airflow任务日志中看到BrokenPipeException被抛出,我将不知道为什么任务失败了,因为一旦我消除了打印语句,我就再也没有在Airflow任务日志中看到任何错误(仅在 $气流工人日志)

长话短说,有很多收获。

  1. 不要通过导入日志记录然后使用print("something to log")然后使用import logging

  2. 这样的日志记录类来使用Airflow的内置日志记录
  3. 如果您将AWS EC2-Instance用作Airflow的服务器,则可能遇到此问题:https://github.com/apache/incubator-airflow/pull/2484该问题的修复已集成到Airflow版本1.10中(I' m当前使用Airflow版本1.9)。因此,升级您的Airflow version to 1.10。您也可以使用the command here logging.info("something to log")。另外,如果您不想升级Airflow版本,则可以按照the github issue上的步骤操作,以使用修复程序手动更新文件,或者派生Airflow并选择修复该文件的提交。