我是Spark和Hadoop的新手。我正在尝试使用Spark 2.0设置EC2群集。
我将文件复制到临时HDFS,可以看到它正在使用 cd ../.
root@ip-172-31-58-53 bin]$ ./hadoop fs -ls /root/
Warning: $HADOOP_HOME is deprecated.
Found 2 items
drwxr-xr-x - root supergroup 0 2017-05-23 12:08
/root/_distcp_logs_sls6bc
-rw-r--r-- 3 root supergroup 543046714 2017-05-23 12:08
/root/input.csv
这是我提交的python代码:
import sys
import numpy as np
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("MatrixMult")\
.getOrCreate()
df = spark.read.option("header","true").csv("hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000/root/input.csv")
df.show(10)
spark.close()
我的hadoop core-site.xml包含以下内容:
<property>
<name>fs.default.name</name>
<value>hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000</value>
</property>
这是我提交工作时遇到的错误:
Traceback (most recent call last):
File "/root/python_code/matrix_mult.py", line 12, in <module>
df = spark.read.option("header","true").csv("hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000/root/input.csv")
File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 380, in csv
File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.csv.
: java.io.IOException: Failed on local exception: java.io.IOException: Broken pipe; Host Details : local host is: "ip-172-31-58-53.ec2.internal/172.31.58.53"; destination host is: "ec2-54-144-193-191.compute-1.amazonaws.com":9000;
...
知道为什么会这样吗?有关如何调试它的任何提示?我尝试过使用内部名称,但这也不起作用。提前谢谢。
答案 0 :(得分:0)
我认为您只需将fs.defaultFS或fs.default.name:my core-site.xml配置为:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:8020</value>
</property>
</configuration>
答案 1 :(得分:0)
原因很愚蠢。我使用从Apache下载的预编译二进制文件。它希望您拥有Hadoop 2.当您运行EC2脚本时,您必须传递标志--hadoop-major-version = 2。我没有那样做。
我用这个标志重建了集群,它解决了这个问题。