我正在尝试将我的3000个观测值和77个功能.csv文件导入为H2O数据帧(在Spark会话中):
(第一种方法)
# Convert pandas dataframe to H2O dataframe
import h2o
h2o.init()
data_train = h2o.import_file('/u/users/vn505f6/data.csv')
但是,出现以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 102, in __init__
column_names, column_types, na_strings)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 143, in _upload_python_object
self._upload_parse(tmp_path, destination_frame, 1, separator, column_names, column_types, na_strings)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 319, in _upload_parse
self._parse(rawkey, destination_frame, header, sep, column_names, column_types, na_strings)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 326, in _parse
return self._parse_raw(setup)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 355, in _parse_raw
self._ex._cache.fill()
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/expr.py", line 346, in fill
res = h2o.api("GET " + endpoint % self._id, data=req_params)["frames"][0]
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 103, in api
return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 402, in request
return self._process_response(resp, save_to)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/backend/connection.py", line 725, in _process_response
raise H2OResponseError(data)
h2o.exceptions.H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
Error: Unknown parameter: full_column_count
Request: GET /3/Frames/Key_Frame__upload_84df978b98892632a7ce19303c4440f3.hex
params: {u'row_offset': '0', u'row_count': '10', u'full_column_count': '-1', u'column_count': '-1', u'column_offset': '0'}
让我注意到,当我在本地计算机上执行此操作时,没有任何错误。在Spark / Hadoop集群上执行相同操作时,出现上述错误。
或者,我尝试在Spark集群中执行以下操作:
(第二种方式)
from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o
h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = h2o.import_file('/u/users/vn505f6/data.csv')
然后出现以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 414, in import_file
return H2OFrame()._import_parse(path, pattern, destination_frame, header, sep, col_names, col_types, na_strings)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/frame.py", line 311, in _import_parse
rawkey = h2o.lazy_import(path, pattern)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 282, in lazy_import
return _import(path, pattern)
File "/u/users/svcssae/pyenv/prod_python_libs/lib/python2.7/site-packages/h2o/h2o.py", line 291, in _import
if j["fails"]: raise ValueError("ImportFiles of " + path + " failed on " + str(j["fails"]))
ValueError: ImportFiles of /u/users/vn505f6/data.csv failed on [u'/u/users/vn505f6/data.csv']
pandas数据框的列名称是类似以下的字符串:u_cnt_days_with_sale_14day
。
这是什么错误,我该如何解决?
PS
以下是创建Spark集群/上下文的命令行命令:
SPARK_HOME=/u/users/******/spark-2.3.0 \
Q_CORE_LOC=/u/users/******/q-core \
ENV=local \
HIVE_HOME=/usr/hdp/current/hive-client \
SPARK2_HOME=/u/users/******/spark-2.3.0 \
HADOOP_CONF_DIR=/etc/hadoop/conf \
HIVE_CONF_DIR=/etc/hive/conf \
HDFS_PREFIX=hdfs:// \
PYTHONPATH=/u/users/******/q-core/python-lib:/u/users/******/three-queues/python-lib:/u/users/******/pyenv/prod_python_libs/lib/python2.7/site-packages/:$PYTHON_PATH \
YARN_HOME=/usr/hdp/current/hadoop-yarn-client \
SPARK_DIST_CLASSPATH=$(hadoop classpath):$(yarn classpath):/etc/hive/conf/hive-site.xml \
PYSPARK_PYTHON=/usr/bin/python2.7 \
QQQ_LOC=/u/users/******/three-queues \
/u/users/******/spark-2.3.0/bin/pyspark \
--master yarn \
--executor-memory 10g \
--num-executors 128 \
--executor-cores 10 \
--conf spark.port.maxRetries=80 \
--conf spark.dynamicAllocation.enabled=False \
--conf spark.default.parallelism=6000 \
--conf spark.sql.shuffle.partitions=6000 \
--principal ************************ \
--queue default \
--name interactive_H2O_MT \
--keytab /u/users/******/.******.keytab \
--driver-memory 10g
答案 0 :(得分:1)
最后,我要做的是首先将.csv文件作为pandas数据框导入,然后将其转换为H2O数据框:
from pysparkling import H2OContext
from ssat_utils.spark import SparkUtilities
import h2o
import pandas as pd
h2o_context = H2OContext.getOrCreate(SparkUtilities.spark)
data_train = pd.read_csv('/u/users/vn505f6/data.csv')
data_train = h2o.H2OFrame(data_train)
我真的不知道为什么这样做有效,同时以帖子上方的两种不同方式将.csv文件直接导入为H2O数据框无效。