Question

我是Spark和Hadoop的新手。我正在尝试使用Spark 2.0设置EC2群集。

我将文件复制到临时HDFS，可以看到它正在使用 cd ../.

root@ip-172-31-58-53 bin]$ ./hadoop fs -ls /root/
Warning: $HADOOP_HOME is deprecated.

Found 2 items
drwxr-xr-x   - root supergroup          0 2017-05-23 12:08 
/root/_distcp_logs_sls6bc
-rw-r--r--   3 root supergroup  543046714 2017-05-23 12:08 
/root/input.csv

这是我提交的python代码：

import sys

import numpy as np
from pyspark.sql import SparkSession

if __name__ == "__main__":
    spark = SparkSession\
        .builder\
        .appName("MatrixMult")\
        .getOrCreate()

    df = spark.read.option("header","true").csv("hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000/root/input.csv")

    df.show(10)

    spark.close()

我的hadoop core-site.xml包含以下内容：

<property>
  <name>fs.default.name</name>
  <value>hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000</value>
</property>

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000</value>
</property>

这是我提交工作时遇到的错误：

Traceback (most recent call last):
  File "/root/python_code/matrix_mult.py", line 12, in <module>
    df = spark.read.option("header","true").csv("hdfs://ec2-54-144-193-191.compute-1.amazonaws.com:9000/root/input.csv")
  File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 380, in csv
  File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/root/spark-2.1.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.csv.
: java.io.IOException: Failed on local exception: java.io.IOException: Broken pipe; Host Details : local host is: "ip-172-31-58-53.ec2.internal/172.31.58.53"; destination host is: "ec2-54-144-193-191.compute-1.amazonaws.com":9000; 
...

知道为什么会这样吗？有关如何调试它的任何提示？我尝试过使用内部名称，但这也不起作用。提前谢谢。

Answer 1

我认为您只需将fs.defaultFS或fs.default.name：my core-site.xml配置为：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://master:8020</value>
    </property>
</configuration>

Answer 2

原因很愚蠢。我使用从Apache下载的预编译二进制文件。它希望您拥有Hadoop 2.当您运行EC2脚本时，您必须传递标志--hadoop-major-version = 2。我没有那样做。

我用这个标志重建了集群，它解决了这个问题。

无法通过PySpark访问HDFS中的文件

2 个答案: