YARN群集上的Spark Streaming失败

时间:2015-08-13 07:11:40

标签: apache-spark yarn spark-streaming pyspark

我有1个主服务器和2个服务器的集群。我正在主人中运行火花流,我想利用群集中的所有节点。我在代码中指定了一些参数,如驱动程序内存和执行程序内存。当我在我的spark-submit中给出--deploy-mode cluster --master yarn-cluster时,会出现以下错误。

> log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/12 13:24:49 INFO Client: Requesting a new application from cluster with 3 NodeManagers
15/08/12 13:24:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/08/12 13:24:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/08/12 13:24:49 INFO Client: Setting up container launch context for our AM
15/08/12 13:24:49 INFO Client: Preparing resources for our AM container
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/assembly/target/scala-2.10/spark-assembly-1.4.1-hadoop2.5.0-cdh5.3.5.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py
15/08/12 13:24:49 INFO Client: Setting up the launch environment for our AM container
15/08/12 13:24:49 INFO SecurityManager: Changing view acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: Changing modify acls to: hdfs
15/08/12 13:24:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hdfs); users with modify permissions: Set(hdfs)
15/08/12 13:24:49 INFO Client: Submitting application 3808 to ResourceManager
15/08/12 13:24:49 INFO YarnClientImpl: Submitted application application_1437639737006_3808
15/08/12 13:24:50 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:50 INFO Client: 
   client token: N/A
   diagnostics: N/A
   ApplicationMaster host: N/A
   ApplicationMaster RPC port: -1
   queue: root.hdfs
   start time: 1439385889600
   final status: UNDEFINED
   tracking URL: http://hostname:port/proxy/application_1437639737006_3808/
   user: hdfs
15/08/12 13:24:51 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:52 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:53 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:54 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:55 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:56 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:57 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:58 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:24:59 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:00 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:01 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:02 INFO Client: Application report for application_1437639737006_3808 (state: ACCEPTED)
15/08/12 13:25:03 INFO Client: Application report for application_1437639737006_3808 (state: FAILED)
15/08/12 13:25:03 INFO Client: 
   client token: N/A
   diagnostics: Application application_1437639737006_3808 failed 2 times due to AM Container for appattempt_1437639737006_3808_000002 exited with  exitCode: -1000 due to: File file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip does not exist
.Failing this attempt.. Failing the application.
   ApplicationMaster host: N/A
   ApplicationMaster RPC port: -1
   queue: root.hdfs
   start time: 1439385889600
   final status: FAILED
   tracking URL: http://hostname:port/cluster/app/application_1437639737006_3808
   user: hdfs
Exception in thread "main" org.apache.spark.SparkException: Application application_1437639737006_3808 finished with failed status
  at org.apache.spark.deploy.yarn.Client.run(Client.scala:855)
  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:881)
  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

如何解决此问题?如果我做错了,请帮助我。

4 个答案:

答案 0 :(得分:1)

您提交的文件:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip不存在。

答案 1 :(得分:0)

在使用Yarn Cluster模式运行时,您始终需要为执行程序指定其他内存设置,并且还需要单独指定内存,另外,您还需要指定驱动程序详细信息。现在为例子

Amazon EC2环境(已预留):

SELECT

始终记得在每个任务节点中将其他第三方库或jar添加到Classpath中,您可以将它们直接添加到每个Node上的Spark或Hadoop Classpath。

备注: 1)如果您正在使用Amazon EMR,那么可以使用Custom Bootstrap Actions和S3来实现。 2)也删除相互冲突的罐子。有时你会看到一个不必要的NullPointerException,这可能是它的关键原因之一。

如果可能,请使用

添加堆栈跟踪
VIP_CODES_VW

这样我就能以更具体的方式回答你。

答案 2 :(得分:0)

我最近遇到了同样的问题。这是我的情景:

Cloudera托管具有7个节点的CDH 5.3.3群集。我从其中一个节点提交作业,并且在纱线簇和纱线主模式下都会出现同样的问题。

如果你看一下堆栈跟踪,你会找到这一行 -

15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1.jar
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/pyspark.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip
15/08/12 13:24:49 INFO Client: Source and destination file systems are the same. Not copying file:/home/hdfs/spark-1.4.1/examples/src/main/python/streaming/kyt.py

这就是为什么作业失败的原因,因为没有复制资源。

就我而言,它是通过纠正HADOOP_CONF_DIR路径解决的。它没有指向包含core-site.xml和yarn-site.xml以及其他配置文件的确切文件夹。修复此问题后,在ApplicationMaster启动期间复制资源并正确运行作业。

答案 3 :(得分:0)

我能够通过在运行时提供驱动程序内存和执行程序内存来​​解决这个问题。

spark-submit --driver-memory 1g --executor-memory 1g --class com.package.App --master yarn --deploy-mode cluster /home/spark.jar