我正在运行一个Dask-YARN作业,该作业使用PyArrow的HDFS IO库将结果字典转储到HDFS(下面的追溯显示的代码)中。但是,作业间歇性地出现以下错误,并非每次运行,仅在某些时候。我无法确定此问题的根本原因,有人有什么想法吗?
File "/extractor.py", line 87, in __call__
json.dump(results_dict, fp=_UTF8Encoder(f), indent=4)
File "pyarrow/io.pxi", line 72, in pyarrow.lib.NativeFile.__exit__
File "pyarrow/io.pxi", line 130, in pyarrow.lib.NativeFile.close
File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS CloseFile failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port
答案 0 :(得分:0)
事实证明,这是由于在延迟对象上重复执行“ dask.get”任务而导致的,这导致多个进程试图写入同一文件。