在Spark 2.0.0中设置persistent-hdfs

时间:2016-09-14 18:58:50

标签: apache-spark hdfs pyspark persistent-data

在新的Spark 2.0.0群集中从短暂切换到持久性hdfs的正确方法是什么(用ec2脚本旋转)?

这就是我正在做的事情:

/root/ephemeral-hdfs/sbin/stop-dfs.sh
/root/persistent-hdfs/sbin/start-dfs.sh

然后在pyspark中,当我尝试从S3加载一个简单的json对象时,我得到了这个错误:

>>> df = spark.read.json(s3path)
Traceback (most recent call last):                                              
  File "<stdin>", line 1, in <module>
  File "/root/spark/python/pyspark/sql/readwriter.py", line 220, in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/root/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/root/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/root/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o34.json.
: java.lang.RuntimeException: java.net.ConnectException: Call From ip-172-31-44-104.ec2.internal/172.31.44.104 to ec2-54-162-71-31.compute-1.amazonaws.com:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

错误说它无法找到处理json的java类。 The link suggested in the error指出“导致这种情况的一个常见原因是Hadoop服务未运行”。

所以我在spark/conf/spark-defaults.conf中评论/添加以下几行:

#spark.executor.extraLibraryPath  /root/ephemeral-hdfs/lib/native/
#spark.executor.extraClassPath  /root/ephemeral-hdfs/conf
spark.executor.extraLibraryPath /root/persistent-hdfs/lib/native/
spark.executor.extraClassPath /root/persistent-hdfs/conf

错误仍然存​​在。

我还尝试了sh spark/sbin/stop-all.shsh spark/sbin/start-all.sh,结果仍然相同。

如果我停止persistent-hdfs并启动ephemeral-hdfs(并撤消对conf文件的更改),一切正常,但写入ephemeral-hdfs。

如何正确地做到这一点?

0 个答案:

没有答案