达斯工人死亡;获取死亡的工作日志

时间:2020-05-18 23:45:10

标签: logging dask dask-distributed

相关:

  1. Seeing Logs of Dask Workers
  2. Dask worker seem die but cannot find the worker log to figure out why

设置dask集群(例如具有这样的超级用户守护程序):

cat /etc/supervisor/conf.d/dask_server.py

[program:dask_scheduler]
command=python3 dask_server.py
directory=/home/cgi/m/remote/db_timescale/dask/
stdout_logfile=/var/log/dask/dask_scheduler_stdout.log
stderr_logfile=/var/log/dask/dask_scheduler_stderr.log
autostart=true
autorestart=true
startsecs=10
stopasgroup=true
stopwaitsecs=60
priority=1000
user=cgi

运行LocalCluster的代码:

cat dask_server.py

from dask.distributed import Client, LocalCluster
HOST = '10.8.0.1'
SCHEDULER_PORT = 8711
DASHBOARD_PORT = ':8710'
DASK_WORKER_PROCESSES = 16
SILENCE_LOGS = False

def run_cluster():
    cluster = LocalCluster(dashboard_address=DASHBOARD_PORT, scheduler_port=SCHEDULER_PORT,
                           n_workers=DASK_WORKER_PROCESSES, silence_logs=SILENCE_LOGS)
    print("DASK Cluster Dashboard = http://%s%s/status" % (HOST, DASHBOARD_PORT))

    client = Client(cluster)
    print(client)
    print("Press Enter to quit ...")
    input()

if __name__ == '__main__':
    run_cluster()

当我现在将负载放在submitgather方法上时,我到了某个地方,有些工人死亡。这在日志中显示-但没有显示工人失败的实际原因

tail -f /var/log/dask/dask_scheduler_std*

distributed.worker - ERROR - 'ActorSymbolBasedDetection-1b7dd14d-48d6-468d-8364-1c111715f8a0'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/distributed/worker.py", line 2178, in release_key
    del self.nbytes[key]
KeyError: 'ActorSymbolBasedDetection-1b7dd14d-48d6-468d-8364-1c111715f8a0'
distributed.core - ERROR - 'ActorSymbolBasedDetection-1b7dd14d-48d6-468d-8364-1c111715f8a0'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/distributed/core.py", line 472, in handle_stream
    handler(**merge(extra, msg))
  File "/usr/local/lib/python3.6/dist-packages/distributed/worker.py", line 2157, in steal_request
    self.release_key(key)
  File "/usr/local/lib/python3.6/dist-packages/distributed/worker.py", line 2178, in release_key
    del self.nbytes[key]
KeyError: 'ActorSymbolBasedDetection-1b7dd14d-48d6-468d-8364-1c111715f8a0'

基本上只说工人必须死了。有什么地方可以找到工作程序中实际执行的回溯?

From the docs

在每种情况下,查找更多信息的第一位是给定工人的日志,这很可能会完整地描述发生的情况。工作人员将这些日志打印到其“标准错误”,该错误可能会在您启动工作人员的文本控制台中显示。

因此,我希望在我的主管的stdout / stderr中使用它,但是只有调度程序日志。有没有办法获取死亡的工作日志(到文件/ stdout)?

0 个答案:

没有答案