我正在使用 Dask Distributed ,并且正在尝试从HDFS中存储的CSV创建数据框。 我想到HDFS的连接是成功的,因为我能够打印数据框列的名称。 但是,当我尝试在数据帧上使用 len 函数或任何其他函数时,出现以下错误:
pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/F43479/trip_data_v2.csv
我不明白为什么会出现此错误。我想征求您的意见。
这是我的代码:
# IMPORTS
import dask.dataframe as dd
from dask.distributed import Client
import pyarrow as pa
from pyarrow import csv
from dask import compute,config
import os
import subprocess
# GET HDFS CLASSPATH
classpath = subprocess.Popen(["/usr/hdp/current/hadoop-client/bin/hdfs", "classpath", "--glob"], stdout=subprocess.PIPE).communicate()[0]
# CONFIGURE ENVIRONMENT VARIABLES
os.environ["HADOOP_HOME"] = "/usr/hdp/current/hadoop-client"
os.environ["JAVA_HOME"] = "/home/G60070/installs/jdk1.8.0_201/"
os.environ["CLASSPATH"] = classpath.decode("utf-8")
os.environ["ARROW_LIBHDFS_DIR"] = "/usr/hdp/2.6.5.0-292/usr/lib/"
# LAUNCH DASK DISTRIBUTED
client = Client('10.22.104.37:8786')
# SET HDFS CONNEXION
config.set(hdfs_driver='pyarrow', host='xxxxx.xxx.xx.fr', port=8020)
# READ FILE ON HDFS
folder = 'hdfs://xxxxx.xxx.xx.fr:8020/user/F43479/'
filepath = folder+'trip_data_v2.csv'
df = dd.read_csv(filepath)
# TREATMENTS ON FILE
print(df.columns)# this works
print(len(df))# produces an error
这是我的HDFS存储库的内容:
[F43479@xxxxx dask_tests]$ hdfs dfs -ls /user/F43479/
Found 9 items
-rw-r----- 3 F43479 hdfs 0 2019-03-07 16:42 /user/F43479/-
drwx------ - F43479 hdfs 0 2019-04-03 02:00 /user/F43479/.Trash
drwxr-x--- - F43479 hdfs 0 2019-03-13 16:53 /user/F43479/.hiveJars
drwxr-x--- - F43479 hdfs 0 2019-03-13 16:52 /user/F43479/hive
drwxr-x--- - F43479 hdfs 0 2019-03-15 13:23 /user/F43479/nyctaxi_trip_data
-rw-r----- 3 F43479 hdfs 36 2019-04-15 11:13 /user/F43479/test.csv
-rw-r----- 3 F43479 hdfs 50486731416 2019-03-26 17:37 /user/F43479/trip_data.csv
-rw-r----- 3 F43479 hdfs 5097056230 2019-04-15 13:57 /user/F43479/trip_data_v2.csv
-rw-r----- 3 F43479 hdfs 504867312828 2019-04-02 11:15 /user/F43479/trip_data_x10.csv
最后,代码执行的全部结果:
Index(['vendor_id', 'passenger_count', 'trip_time_in_secs', 'trip_distance'], dtype='object')
Traceback (most recent call last):
File "dask_pa_hdfs.py", line 32, in <module>
print(len(df))
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/dataframe/core.py", line 438, in __len__
split_every=False).compute()
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/base.py", line 397, in compute
results = schedule(dsk, keys, **kwargs)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 2321, in get
direct=direct)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 1655, in gather
asynchronous=asynchronous)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 673, in sync
return sync(self.loop, func, *args, **kwargs)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/distributed/client.py", line 1500, in _gather
traceback)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/bytes/core.py", line 133, in read_block_from_file
with copy.copy(lazy_file) as f:
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/bytes/core.py", line 177, in __enter__
f = SeekableFile(self.fs.open(self.path, mode=mode))
File "/opt/anaconda3/envs/python3-dask/lib/python3.7/site-packages/dask/bytes/pyarrow.py", line 37, in open
return self.fs.open(path, mode=mode, **kwargs)
File "pyarrow/io-hdfs.pxi", line 431, in pyarrow.lib.HadoopFileSystem.open
File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS file does not exist: /user/F43479/trip_data_v2.csv
答案 0 :(得分:0)
您已经在包含客户端的本地进程中仔细设置了环境,以便它可以与HDFS通信。为了找出列,这已经足够了,因为Dask会从客户端流程和数据的前几行开始进行此操作。但是:
client = Client('10.22.104.37:8786')
您的调度程序和工作人员住在其他地方,并且没有您可以使用的环境变量。当您运行任务时,工作人员不知道如何查找文件。
您需要做的是也为工人设置环境。这可以在启动之前完成,或者一旦启动就可以完成:
def setenv():
import os
os.environ["HADOOP_HOME"] = "/usr/hdp/current/hadoop-client"
os.environ["JAVA_HOME"] = "/home/G60070/installs/jdk1.8.0_201/"
os.environ["CLASSPATH"] = classpath.decode("utf-8")
os.environ["ARROW_LIBHDFS_DIR"] = "/usr/hdp/2.6.5.0-292/usr/lib/"
client.run(setenv)
(每个工作人员应返回一组None
)
请注意,如果新员工动态上线,则他们每个人都需要在访问HDFS之前运行此功能。
答案 1 :(得分:0)
我解决了这个问题。它与访问HDFS的权限有关。我正在使用Kerberized HDFS集群,并在边缘节点上启动了 Dask Scheduler进程,并在 Data上启动了 worker进程节点。
要访问HDFS,pyarrow需要做两件事:
然后,要访问HDFS,启动的进程需要通过Kerberos进行身份验证。从调度程序过程中发布代码时,我能够连接到HDFS,因为我的会话通过Kerberos进行了身份验证。这就是为什么我能够获取有关CSV文件列的信息的原因。
但是,由于工作进程未经身份验证,因此它们无法访问HDFS,从而导致了错误。要解决此问题,我们必须停止工作进程,修改用于启动它们的脚本,以便它包含一个kerberos命令以对HDFS(进行某种身份验证)进行身份验证,然后重新启动工作进程。
目前,它可以正常工作,但这意味着Dask与Kerberized集群不兼容。使用我们进行的配置,从工作程序启动计算时,所有用户都对HDFS拥有相同的权限。我认为这不是完全安全的做法