相关:
设置dask集群(例如具有这样的超级用户守护程序):
cat /etc/supervisor/conf.d/dask_server.py
[program:dask_scheduler]
command=python3 dask_server.py
directory=/home/cgi/m/remote/db_timescale/dask/
stdout_logfile=/var/log/dask/dask_scheduler_stdout.log
stderr_logfile=/var/log/dask/dask_scheduler_stderr.log
autostart=true
autorestart=true
startsecs=10
stopasgroup=true
stopwaitsecs=60
priority=1000
user=cgi
运行LocalCluster
的代码:
cat dask_server.py
from dask.distributed import Client, LocalCluster
HOST = '10.8.0.1'
SCHEDULER_PORT = 8711
DASHBOARD_PORT = ':8710'
DASK_WORKER_PROCESSES = 16
SILENCE_LOGS = False
def run_cluster():
cluster = LocalCluster(dashboard_address=DASHBOARD_PORT, scheduler_port=SCHEDULER_PORT,
n_workers=DASK_WORKER_PROCESSES, silence_logs=SILENCE_LOGS)
print("DASK Cluster Dashboard = http://%s%s/status" % (HOST, DASHBOARD_PORT))
client = Client(cluster)
print(client)
print("Press Enter to quit ...")
input()
if __name__ == '__main__':
run_cluster()
当我现在将负载放在submit
和gather
方法上时,我到了某个地方,有些工人死亡。这在日志中显示-但没有显示工人失败的实际原因:
tail -f /var/log/dask/dask_scheduler_std*
distributed.worker - ERROR - 'ActorSymbolBasedDetection-1b7dd14d-48d6-468d-8364-1c111715f8a0'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/distributed/worker.py", line 2178, in release_key
del self.nbytes[key]
KeyError: 'ActorSymbolBasedDetection-1b7dd14d-48d6-468d-8364-1c111715f8a0'
distributed.core - ERROR - 'ActorSymbolBasedDetection-1b7dd14d-48d6-468d-8364-1c111715f8a0'
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/distributed/core.py", line 472, in handle_stream
handler(**merge(extra, msg))
File "/usr/local/lib/python3.6/dist-packages/distributed/worker.py", line 2157, in steal_request
self.release_key(key)
File "/usr/local/lib/python3.6/dist-packages/distributed/worker.py", line 2178, in release_key
del self.nbytes[key]
KeyError: 'ActorSymbolBasedDetection-1b7dd14d-48d6-468d-8364-1c111715f8a0'
基本上只说工人必须死了。有什么地方可以找到工作程序中实际执行的回溯?
在每种情况下,查找更多信息的第一位是给定工人的日志,这很可能会完整地描述发生的情况。工作人员将这些日志打印到其“标准错误”,该错误可能会在您启动工作人员的文本控制台中显示。
因此,我希望在我的主管的stdout
/ stderr
中使用它,但是只有调度程序日志。有没有办法获取死亡的工作日志(到文件/ stdout)?