我正在运行分布在大型University SLURM群集上的Dask。我有一个令人尴尬的并行问题,其中我读取图像并执行一些numpy处理。
我启动集群
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
from dask import delayed
num_workers=10
#job args
extra_args=[
"--error=/home/b.weinstein/logs/dask-worker-%j.err",
"--account=ewhite",
"--output=/home/b.weinstein/logs/dask-worker-%j.out"
]
cluster = SLURMCluster(
processes=1,
queue='hpg2-compute',
cores=1,
memory='20GB',
walltime='48:00:00',
job_extra=extra_args,
local_directory="/home/b.weinstein/logs/",death_timeout=300)
并执行我的功能(Generate.run)
cluster.scale(num_workers)
dask_client = Client(cluster)
#Start dask dashboard? Not clear yet.
dask_client.run_on_scheduler(start_tunnel)
###Set scheduler.
values = [delayed(Generate.run)(x) for x in data_paths]
一切运行一个小时即可完成许多任务。有时,它会损坏图像,并且PIL会引发异常。
(DeepLidar) [b.weinstein@login3 ~]$ python
Python 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 02:32:25)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image
>>> Image.open("/ufrc/ewhite/b.weinstein/NeonData/SJER/DP3.30010.001/2018/FullSite/D17/2018_SJER_3/L3/Camera/Mosaic/V01/2018_SJER_3_252000_4114000_image.tif")
/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:771: UserWarning: Corrupt EXIF data. Expecting to read 2 bytes but only got 0.
warnings.warn(str(msg))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/PIL/Image.py", line 2657, in open
% (filename if filename else fp))
OSError: cannot identify image file '/ufrc/ewhite/b.weinstein/NeonData/SJER/DP3.30010.001/2018/FullSite/D17/2018_SJER_3/L3/Camera/Mosaic/V01/2018_SJER_3_252000_4114000_image.tif'
通过阅读http://distributed.dask.org/en/latest/resilience.html#user-code-failures
我希望任务会失败,但是群集将持续存在。最糟糕的是,我期望工人会失败。相反,所有工人和调度程序都死了。为什么将异常传播到调度程序?日志中没有太多信息:
KeyError: 'start'
Traceback (most recent call last):
File "/home/b.weinstein/DeepLidar/dask_generate.py", line 105, in <module>
run_HPC(data_paths)
File "/home/b.weinstein/DeepLidar/dask_generate.py", line 92, in run_HPC
results = compute(*values,scheduler=dask_client)
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/distributed/client.py", line 2318, in get
direct=direct)
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/distributed/client.py", line 1652, in gather
asynchronous=asynchronous)
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/distributed/client.py", line 670, in sync
return sync(self.loop, func, *args, **kwargs)
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/apps/geos/3.6.2/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/distributed/client.py", line 1497, in _gather
traceback)
File "/apps/geos/3.6.2/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/home/b.weinstein/DeepLidar/DeepForest/Generate.py", line 28, in run
windows = preprocess.create_windows(data,DeepForest_config)
File "/home/b.weinstein/DeepLidar/DeepForest/preprocess.py", line 306, in create_windows
windows=compute_windows(image=image_path, pixels=DeepForest_config["patch_size"], overlap=DeepForest_config["patch_overlap"])
File "/home/b.weinstein/DeepLidar/DeepForest/preprocess.py", line 151, in compute_windows
im = Image.open(image)
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/PIL/Image.py", line 2657, in open
% (filename if filename else fp))
OSError: cannot identify image file '/ufrc/ewhite/b.weinstein/NeonData/SJER/DP3.30010.001/2018/FullSite/D17/2018_SJER_3/L3/Camera/Mosaic/V01/2018_SJER_3_252000_4114000_image.tif'
tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
File "/home/b.weinstein/miniconda/envs/DeepLidar/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
future.result()
concurrent.futures._base.CancelledError
我已尝试增加memory,以防止dask溢出到disk。还有什么可能导致此行为?
尝试将* compute语句包装在一起,这表明尽管文档建议异常不会杀死客户端,但调度程序仍在引发异常。
try:
compute(*values,scheduler='distributed')
except Exception as e:
print(e)
客户端日志中的产量:
Invalid image /ufrc/ewhite/b.weinstein/NeonData/SJER/DP3.30010.001/2018/FullSite/D17/2018_SJER_3/L3/Camera/Mosaic/V01/2018_SJER_3_258000_4111000_image.tif
并杀死一切。我以为dask容忍工人?
Fault tolerance in Spark vs Dask
System Info:
Python 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 02:32:25)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dask
>>> dask.__version__
'0.20.2'
>>> import dask_jobqueue
>>> dask_jobqueue.__version__
'0.4.1'