我试图在研究小组的Hadoop集群上运行dask-on-yarn。
我尝试了以下每条说明:
dd.read_parquet('hdfs://file.parquet', engine='fastparquet')
dd.read_parquet('hdfs://file.parquet', engine='pyarrow')
dd.read_csv('hdfs://file.csv')
每次,都会出现以下错误消息:
~/miniconda3/envs/dask/lib/python3.8/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol)
521 path = cls._strip_protocol(urlpath)
522 update_storage_options(options, storage_options)
--> 523 fs = cls(**options)
524
525 if "w" in mode:
~/miniconda3/envs/dask/lib/python3.8/site-packages/fsspec/spec.py in __call__(cls, *args, **kwargs)
52 return cls._cache[token]
53 else:
---> 54 obj = super().__call__(*args, **kwargs)
55 # Setting _fs_token here causes some static linters to complain.
56 obj._fs_token_ = token
~/miniconda3/envs/dask/lib/python3.8/site-packages/fsspec/implementations/hdfs.py in __init__(self, host, port, user, kerb_ticket, driver, extra_conf, **kwargs)
42 AbstractFileSystem.__init__(self, **kwargs)
43 self.pars = (host, port, user, kerb_ticket, driver, extra_conf)
---> 44 self.pahdfs = HadoopFileSystem(
45 host=host,
46 port=port,
~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/hdfs.py in __init__(self, host, port, user, kerb_ticket, driver, extra_conf)
38 _maybe_set_hadoop_classpath()
39
---> 40 self._connect(host, port, user, kerb_ticket, extra_conf)
41
42 def __reduce__(self):
~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/io-hdfs.pxi in pyarrow.lib.HadoopFileSystem._connect()
~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
OSError: Getting symbol hdfsNewBuilderfailed
我应该如何解决此问题?
这是我在conda env中的软件包:
# Name Version Build Channel
_libgcc_mutex 0.1 main
abseil-cpp 20200225.2 he1b5a44_0 conda-forge
arrow-cpp 0.17.1 py38h1234567_9_cpu conda-forge
attrs 19.3.0 py_0
aws-sdk-cpp 1.7.164 hc831370_1 conda-forge
backcall 0.2.0 py_0
blas 1.0 mkl
bleach 3.1.5 py_0
bokeh 2.1.1 py38_0
boost-cpp 1.72.0 h7b93d67_1 conda-forge
brotli 1.0.7 he6710b0_0
brotlipy 0.7.0 py38h7b6447c_1000
bzip2 1.0.8 h7b6447c_0
c-ares 1.15.0 h7b6447c_1001
ca-certificates 2020.6.24 0
certifi 2020.6.20 py38_0
cffi 1.14.0 py38he30daa8_1
chardet 3.0.4 py38_1003
click 7.1.2 py_0
cloudpickle 1.4.1 py_0
conda-pack 0.4.0 py_0
cryptography 2.9.2 py38h1ba5d50_0
curl 7.71.0 hbc83047_0
cytoolz 0.10.1 py38h7b6447c_0
dask 2.19.0 py_0
dask-core 2.19.0 py_0
dask-yarn 0.8.1 py38h32f6830_0 conda-forge
decorator 4.4.2 py_0
defusedxml 0.6.0 py_0
distributed 2.19.0 py38_0
entrypoints 0.3 py38_0
fastparquet 0.3.2 py38heb32a55_0
freetype 2.10.2 h5ab3b9f_0
fsspec 0.7.4 py_0
gflags 2.2.2 he6710b0_0
glog 0.4.0 he6710b0_0
grpc-cpp 1.30.0 h9ea6770_0 conda-forge
grpcio 1.27.2 py38hf8bcb03_0
heapdict 1.0.1 py_0
icu 67.1 he1b5a44_0 conda-forge
idna 2.10 py_0
importlib-metadata 1.7.0 py38_0
importlib_metadata 1.7.0 0
intel-openmp 2020.1 217
ipykernel 5.3.0 py38h5ca1d4c_0
ipython 7.16.1 py38h5ca1d4c_0
ipython_genutils 0.2.0 py38_0
jedi 0.17.1 py38_0
jinja2 2.11.2 py_0
jpeg 9b h024ee3a_2
json5 0.9.5 py_0
jsonschema 3.2.0 py38_0
jupyter_client 6.1.3 py_0
jupyter_core 4.6.3 py38_0
jupyterlab 2.1.5 py_0
jupyterlab_server 1.1.5 py_0
krb5 1.18.2 h173b8e3_0
ld_impl_linux-64 2.33.1 h53a641e_7
libcurl 7.71.0 h20c2e04_0
libedit 3.1.20191231 h7b6447c_0
libevent 2.1.10 hcdb4288_1 conda-forge
libffi 3.3 he6710b0_1
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libllvm9 9.0.1 h4a3c616_0
libpng 1.6.37 hbc83047_0
libprotobuf 3.12.3 hd408876_0
libsodium 1.0.18 h7b6447c_0
libssh2 1.9.0 h1ba5d50_1
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
llvmlite 0.33.0 py38hd408876_0
locket 0.2.0 py38_1
lz4-c 1.9.2 he6710b0_0
markupsafe 1.1.1 py38h7b6447c_0
mistune 0.8.4 py38h7b6447c_1000
mkl 2020.1 217
mkl-service 2.3.0 py38he904b0f_0
mkl_fft 1.1.0 py38h23d657b_0
mkl_random 1.1.1 py38h0573a6f_0
msgpack-python 1.0.0 py38hfd86e86_1
nbconvert 5.6.1 py38_0
nbformat 5.0.7 py_0
ncurses 6.2 he6710b0_1
notebook 6.0.3 py38_0
numba 0.50.1 py38h0573a6f_0
numpy 1.18.5 py38ha1c710e_0
numpy-base 1.18.5 py38hde5b4d6_0
olefile 0.46 py_0
openssl 1.1.1g h7b6447c_0
packaging 20.4 py_0
pandas 1.0.5 py38h0573a6f_0
pandoc 2.9.2.1 0
pandocfilters 1.4.2 py38_1
parquet-cpp 1.5.1 2 conda-forge
parso 0.7.0 py_0
partd 1.1.0 py_0
pexpect 4.8.0 py38_0
pickleshare 0.7.5 py38_1000
pillow 7.1.2 py38hb39fc2d_0
pip 20.1.1 py38_1
prometheus_client 0.8.0 py_0
prompt-toolkit 3.0.5 py_0
protobuf 3.12.3 py38he6710b0_0
psutil 5.7.0 py38h7b6447c_0
ptyprocess 0.6.0 py38_0
pyarrow 0.17.1 py38h1234567_9_cpu conda-forge
pycparser 2.20 py_0
pygments 2.6.1 py_0
pyopenssl 19.1.0 py38_0
pyparsing 2.4.7 py_0
pyrsistent 0.16.0 py38h7b6447c_0
pysocks 1.7.1 py38_0
python 3.8.3 hcff3b4d_2
python-dateutil 2.8.1 py_0
python_abi 3.8 1_cp38 conda-forge
pytz 2020.1 py_0
pyyaml 5.3.1 py38h7b6447c_1
pyzmq 19.0.1 py38he6710b0_1
re2 2020.07.01 he1b5a44_0 conda-forge
readline 8.0 h7b6447c_0
requests 2.24.0 py_0
send2trash 1.5.0 py38_0
setuptools 47.3.1 py38_0
six 1.15.0 py_0
skein 0.8.0 py38h32f6830_1 conda-forge
snappy 1.1.8 he6710b0_0
sortedcontainers 2.2.2 py_0
sqlite 3.32.3 h62c20be_0
tbb 2020.0 hfd86e86_0
tblib 1.6.0 py_0
terminado 0.8.3 py38_0
testpath 0.4.4 py_0
thrift 0.13.0 py38he6710b0_0
thrift-cpp 0.13.0 h62aa4f2_2 conda-forge
tk 8.6.10 hbc83047_0
toolz 0.10.0 py_0
tornado 6.0.4 py38h7b6447c_1
traitlets 4.3.3 py38_0
typing_extensions 3.7.4.2 py_0
urllib3 1.25.9 py_0
wcwidth 0.2.5 py_0
webencodings 0.5.1 py38_1
wheel 0.34.2 py38_0
xz 5.2.5 h7b6447c_0
yaml 0.2.5 h7b6447c_0
zeromq 4.3.2 he6710b0_2
zict 2.0.0 py_0
zipp 3.1.0 py_0
zlib 1.2.11 h7b6447c_3
zstd 1.4.4 h0b5b093_3
Hadoop集群正在运行版本Hadoop 2.7.0-mapr-1607
。
使用以下方法创建群集对象
:# Create a cluster where each worker has two cores and eight GiB of memory
cluster = YarnCluster(
environment='conda-env-packed-for-worker-nodes.tar.gz',
worker_env={
# See https://github.com/dask/dask-yarn/pull/30#issuecomment-434001858
'ARROW_LIBHDFS_DIR': '/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib',
},
)
我怀疑hadoop-0.20.2
环境变量中的ARROW_LIBHDFS_DIR
和hadoop CLI版本Hadoop 2.7.0
之间的版本不匹配。
我必须手动指定pyarrow才能使用此文件(使用以下设置:https://stackoverflow.com/a/62749053/1147061)。 libhdfs.so
下未提供必需的文件/opt/mapr/hadoop/hadoop-2.7.0/
。通过libhdfs3
安装conda install -c conda-forge libhdfs3
也不能解决要求。
这可能是问题吗?
答案 0 :(得分:0)
(部分答案)
要使用libhdfs3(目前维护不佳),您需要致电
dd.read_csv('hdfs://file.csv', storage_options={'driver': 'libhdfs3'})
,当然,还要安装libhdfs3。 hadoop库选项对此没有帮助,因为它们是独立的代码路径。
我还怀疑使JNI libhdfs(不带“ 3”)工作是找到正确的.so文件的情况。