Question

我有1个主服务器和2个服务器的集群。我正在主人中运行火花流，我想利用群集中的所有节点。我在代码中指定了一些参数，如驱动程序内存和执行程序内存。当我在我的spark-submit中给出--deploy-mode cluster --master yarn-cluster时，会出现以下错误。

> log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/12 13:24:49 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/08/12 13:24:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/08/12 13:24:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/08/12 13:24:49 INFO Client: Setting up container launch context for our AM
15/08/12 13:24:49 INFO Client: Preparing resources for our AM container
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.5.0-cdh5.3.5.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py
15/08/12 13:24:49 INFO Client: Setting up the launch environment for our AM container
15/08/12 13:24:49 INFO SecurityManager: Changing view acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: Changing modify acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdfs); users with modify permissions: Set(hdfs)
15/08/12 13:24:49 INFO Client: Submitting application 3808 to ResourceManager
15/08/12 13:24:49 INFO YarnClientImpl: Submitted application application_1437639737006_3808
15/08/12 13:24:50 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:50 INFO Client: 
   client token: N/A
   diagnostics: N/A
   ApplicationMaster host: N/A
   ApplicationMaster RPC port: -1
   queue: root.hdfs
   start time: 1439385889600
   final status: UNDEFINED
   tracking URL: http://hostname:port/proxy/application_1437639737006_3808/
   user: hdfs
15/08/12 13:24:51 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:52 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:53 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:54 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:55 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:56 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:57 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:58 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:59 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:00 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:01 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:02 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:03 INFO Client: Application report for application_1437639737006_3808 (state: FAILED)
15/08/12 13:25:03 INFO Client: 
   client token: N/A
   diagnostics: Application application_1437639737006_3808 failed 2 times due to AM Container for appattempt_1437639737006_3808_000002 exited with  exitCode: -1000 due to: File file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip does not exist
.Failing this attempt.. Failing the application.
   ApplicationMaster host: N/A
   ApplicationMaster RPC port: -1
   queue: root.hdfs
   start time: 1439385889600
   final status: FAILED
   tracking URL: http://hostname:port/cluster/app/application_1437639737006_3808
   user: hdfs
Exception in thread "main" org.apache.spark.SparkException: Application application_1437639737006_3808 finished with failed status
  at org.apache.spark.deploy.yarn.Client.run(Client.scala:855)
  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881)
  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

如何解决此问题？如果我做错了，请帮助我。

Answer 1

您提交的文件：/home/hdfs/spark-1.4.1/python/lib/pyspark.zip不存在。

Answer 2

在使用Yarn Cluster模式运行时，您始终需要为执行程序指定其他内存设置，并且还需要单独指定内存，另外，您还需要指定驱动程序详细信息。现在为例子

Amazon EC2环境（已预留）：

SELECT

始终记得在每个任务节点中将其他第三方库或jar添加到Classpath中，您可以将它们直接添加到每个Node上的Spark或Hadoop Classpath。

备注： 1）如果您正在使用Amazon EMR，那么可以使用Custom Bootstrap Actions和S3来实现。 2）也删除相互冲突的罐子。有时你会看到一个不必要的NullPointerException，这可能是它的关键原因之一。

如果可能，请使用

添加堆栈跟踪

VIP_CODES_VW

这样我就能以更具体的方式回答你。

Answer 3

我最近遇到了同样的问题。这是我的情景：

Cloudera托管具有7个节点的CDH 5.3.3群集。我从其中一个节点提交作业，并且在纱线簇和纱线主模式下都会出现同样的问题。

如果你看一下堆栈跟踪，你会找到这一行 -

15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py

这就是为什么作业失败的原因，因为没有复制资源。

就我而言，它是通过纠正HADOOP_CONF_DIR路径解决的。它没有指向包含core-site.xml和yarn-site.xml以及其他配置文件的确切文件夹。修复此问题后，在ApplicationMaster启动期间复制资源并正确运行作业。

Answer 4

我能够通过在运行时提供驱动程序内存和执行程序内存来解决这个问题。

spark-submit --driver-memory 1g --executor-memory 1g --class com.package.App --master yarn --deploy-mode cluster /home/spark.jar

YARN群集上的Spark Streaming失败

4 个答案: