在新的Spark 2.0.0群集中从短暂切换到持久性hdfs的正确方法是什么(用ec2脚本旋转)?
这就是我正在做的事情:
/root/ephemeral-hdfs/sbin/stop-dfs.sh
/root/persistent-hdfs/sbin/start-dfs.sh
然后在pyspark中,当我尝试从S3加载一个简单的json对象时,我得到了这个错误:
>>> df = spark.read.json(s3path)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/spark/python/pyspark/sql/readwriter.py", line 220, in json
return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/root/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/root/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/root/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o34.json.
: java.lang.RuntimeException: java.net.ConnectException: Call From ip-172-31-44-104.ec2.internal/172.31.44.104 to ec2-54-162-71-31.compute-1.amazonaws.com:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
错误说它无法找到处理json的java类。 The link suggested in the error指出“导致这种情况的一个常见原因是Hadoop服务未运行”。
所以我在spark/conf/spark-defaults.conf
中评论/添加以下几行:
#spark.executor.extraLibraryPath /root/ephemeral-hdfs/lib/native/
#spark.executor.extraClassPath /root/ephemeral-hdfs/conf
spark.executor.extraLibraryPath /root/persistent-hdfs/lib/native/
spark.executor.extraClassPath /root/persistent-hdfs/conf
错误仍然存在。
我还尝试了sh spark/sbin/stop-all.sh
和sh spark/sbin/start-all.sh
,结果仍然相同。
如果我停止persistent-hdfs并启动ephemeral-hdfs(并撤消对conf文件的更改),一切正常,但写入ephemeral-hdfs。
如何正确地做到这一点?