使用scala将eclipse中的spark作业提交给yarn-client

时间:2016-05-03 15:27:42

标签: eclipse scala hadoop apache-spark yarn

我是spark和scala的新手,我很难提交作为YARN客户端的Spark工作。通过spark shell(spark submit)这样做是没有问题的:首先在eclipse中创建一个spark作业,然后将其编译成jar并通过内核shell使用spark submit,如:

 spark-submit --class ebicus.WordCount /u01/stage/mvn_test-0.0.1.jar

然而,使用Eclipse直接编译并将其提交给YARN似乎很难。

我的项目设置如下:我的群集正在运行CDH cloudera 5.6。我有一个maven项目,使用scala,My classpath / which is in sinc with my pom.xml

我的代码如下:

package test

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.TaskContext;
import akka.actor
import org.apache.spark.deploy.yarn.ClientArguments
import org.apache.spark.deploy.ClientArguments

object WordCount {

  def main(args: Array[String]): Unit = {
//    val workaround = new File(".");
    System.getProperties().put("hadoop.home.dir",  "c:\\winutil\\");
    System.setProperty("SPARK_YARN_MODE", "true");

   val conf = new SparkConf()
      .setAppName("WordCount")
      .setMaster("yarn-client")
      .set("spark.hadoop.fs.defaultFS", "hdfs://namecluster.com:8020/user/username")
      .set("spark.hadoop.dfs.nameservices", "namecluster.com:8020")
      .set("spark.hadoop.yarn.resourcemanager.hostname", "namecluster.com")
      .set("spark.hadoop.yarn.resourcemanager.address", "namecluster:8032")
      .set("spark.hadoop.yarn.application.classpath",
              "/etc/hadoop/conf,"
          +"/usr/lib/hadoop/*,"
          +"/usr/lib/hadoop/lib/*,"
          +"/usr/lib/hadoop-hdfs/*,"
          +"/usr/lib/hadoop-hdfs/lib/*,"
          +"/usr/lib/hadoop-mapreduce/*,"
          +"/usr/lib/hadoop-mapreduce/lib/*,"
          +"/usr/lib/hadoop-yarn/*,"
          +"/usr/lib/hadoop-yarn/lib/*,"
          +"/usr/lib/spark/*,"
          +"/usr/lib/spark/lib/*,"
          +"/usr/lib/spark/lib/*"
      )
      .set("spark.driver.host","localhost");

    val sc = new SparkContext(conf);

    val file = sc.textFile("hdfs://namecluster.com:8020/user/root/testdir/test.csv")
    //Count number of words from a hive table (split is based on char 001)
    val counts = file.flatMap(line => line.split(1.toChar)).map(word => (word, 1)).reduceByKey(_ + _)

    //swap key and value with count value and sort from high to low 
    val test = counts.map(_.swap).sortBy(word =>(word,1), false, 5)

    test.saveAsTextFile("hdfs://namecluster.com:8020/user/root/test1")

  }

}

我在hadoop资源管理器的日志文件中收到下一条错误消息

YARN executor launch context:
  env:
    CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>/etc/hadoop/conf<CPS>/usr/lib/hadoop/*<CPS>/usr/lib/hadoop/lib/*<CPS>/usr/lib/hadoop-hdfs/*<CPS>/usr/lib/hadoop-hdfs/lib/*<CPS>/usr/lib/hadoop-mapreduce/*<CPS>/usr/lib/hadoop-mapreduce/lib/*<CPS>/usr/lib/hadoop-yarn/*<CPS>/usr/lib/hadoop-yarn/lib/*<CPS>/usr/lib/spark/*<CPS>/usr/lib/spark/lib/*<CPS>/usr/lib/spark/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$MR2_CLASSPATH
    SPARK_LOG_URL_STDERR -> http://cloudera-002.fusion.ebicus.com:8042/node/containerlogs/container_1461679867178_0026_01_000005/hadriaans/stderr?start=-4096
    SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1461679867178_0026
    SPARK_YARN_CACHE_FILES_FILE_SIZES -> 520473
    SPARK_USER -> hadriaans
    SPARK_YARN_CACHE_FILES_VISIBILITIES -> PRIVATE
    SPARK_YARN_MODE -> true
    SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1462288779267
    SPARK_LOG_URL_STDOUT -> http://cloudera-002.fusion.ebicus.com:8042/node/containerlogs/container_1461679867178_0026_01_000005/hadriaans/stdout?start=-4096
    SPARK_YARN_CACHE_FILES -> hdfs://cloudera-003.fusion.ebicus.com:8020/user/hadriaans/.sparkStaging/application_1461679867178_0026/spark-yarn_2.10-1.5.0.jar#__spark__.jar

  command:
    {{JAVA_HOME}}/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms1024m -Xmx1024m -Djava.io.tmpdir={{PWD}}/tmp '-Dspark.driver.port=49961' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@10.29.51.113:49961/user/CoarseGrainedScheduler --executor-id 4 --hostname cloudera-002.fusion.ebicus.com --cores 1 --app-id application_1461679867178_0026 --user-class-path file:$PWD/__app__.jar 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
===============================================================================

16/05/03 17:19:58 INFO impl.ContainerManagementProtocolProxy: Opening proxy : cloudera-002.fusion.ebicus.com:8041
16/05/03 17:20:01 INFO yarn.YarnAllocator: Completed container container_1461679867178_0026_01_000005 (state: COMPLETE, exit status: 1)
16/05/03 17:20:01 INFO yarn.YarnAllocator: Container marked as failed: container_1461679867178_0026_01_000005. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1461679867178_0026_01_000005
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
    at org.apache.hadoop.util.Shell.run(Shell.java:478)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

欢迎任何提示或建议。

1 个答案:

答案 0 :(得分:2)

根据我之前的经验,有两种可能的情况可能会导致这种非描述性错误(我从Eclipse提交作业,但使用Java)

  1. 我注意到你没有将JAR传递给SparkContext的配置。如果我从Eclipse中提交时删除了传递JAR的行,我的代码将失败并出现完全相同的错误。因此,基本上你将尚未存在的JAR的路径设置到代码中,然后将项目导出为Runnable JAR,它将所有传递依赖项打包到它中,并导出到先前在代码中设置的路径。这就是它在Java中的表现:

      

    SparkConf sparkConfiguration = new SparkConf();
      sparkConfiguration.setJars(new String [] {“jar的路径”});

  2. 检查您的群集是否健康,您的tmp目录可能已满。检查所有hadoop日志记录文件,其中一些(抱歉不记得哪些)在发生这种情况时提供更多详细信息(一些警告)。