如何使用--files选项执行上传到工作节点的应用程序?

时间:2017-10-22 16:25:21

标签: scala hadoop apache-spark yarn

我使用spark-submit将文件上传到我的工作节点,我想访问此文件。这个文件是二进制文件,我想执行它。我已经知道如何通过scala执行该文件,但我一直在找到#34;文件未找到"例外,我无法找到访问它的方法。

我使用以下命令提交我的工作。

spark-submit --class Main --master yarn --deploy-mode cluster --files las2las myjar.jar

当作业执行时,我注意到它被上传到当前正在运行的应用程序的登台目录,当我尝试运行以下内容时,它没有工作。

val command = "hdfs://url/user/username/.sparkStaging/" + sparkContext.applicationId + "/las2las" !!

这是抛出的异常:

17/10/22 18:15:57 ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException: Cannot run program "hdfs://url/user/username/.sparkStaging/application_1486393309284_26788/las2las": error=2, No such file or directory

所以,我的问题是,如何访问las2las文件?

2 个答案:

答案 0 :(得分:1)

使用SparkFiles

 val path = org.apache.spark.SparkFiles.get("las2las")

答案 1 :(得分:1)

  

如何访问las2las文件?

当您转到http://localhost:8088/cluster的YARN用户界面并点击Spark应用程序的应用程序 ID 时,您将被重定向到包含容器日志的页面。点击日志。在 stderr 中,您应找到与以下内容类似的行:

===============================================================================
YARN executor launch context:
  env:
    CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>{{PWD}}/__spark_conf__/__hadoop_conf__
    SPARK_YARN_STAGING_DIR -> file:/Users/jacek/.sparkStaging/application_1508700955259_0002
    SPARK_USER -> jacek
    SPARK_YARN_MODE -> true

  command:
    {{JAVA_HOME}}/bin/java \ 
      -server \ 
      -Xmx1024m \ 
      -Djava.io.tmpdir={{PWD}}/tmp \ 
      '-Dspark.worker.ui.port=44444' \ 
      '-Dspark.driver.port=55365' \ 
      -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
      -XX:OnOutOfMemoryError='kill %p' \ 
      org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
      --driver-url \ 
      spark://CoarseGrainedScheduler@192.168.1.6:55365 \ 
      --executor-id \ 
      <executorId> \ 
      --hostname \ 
      <hostname> \ 
      --cores \ 
      1 \ 
      --app-id \ 
      application_1508700955259_0002 \ 
      --user-class-path \ 
      file:$PWD/__app__.jar \ 
      1><LOG_DIR>/stdout \ 
      2><LOG_DIR>/stderr

  resources:
    __spark_libs__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_libs__618005180363157241.zip" } size: 218111116 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
    __spark_conf__ -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/__spark_conf__.zip" } size: 105328 timestamp: 1508701349000 type: ARCHIVE visibility: PRIVATE
    hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE
===============================================================================

我执行了我的Spark应用程序,如下所示:

YARN_CONF_DIR=/tmp \
./bin/spark-shell --master yarn --deploy-mode client --files hello.sh

所以兴趣点是:

hello.sh -> resource { scheme: "file" port: -1 file: "/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh" } size: 33 timestamp: 1508701349000 type: FILE visibility: PRIVATE

你应该找到一个与shell脚本路径类似的行(我的是/Users/jacek/.sparkStaging/application_1508700955259_0002/hello.sh)。

  

这个文件是二进制文件,我想执行它。

使用该行,您可以尝试执行它。

import scala.sys.process._
scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh" !!
warning: there was one feature warning; re-run with -feature for details
java.io.IOException: Cannot run program "/Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh": error=13, Permission denied
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
  at scala.sys.process.ProcessBuilderImpl$Simple.run(ProcessBuilderImpl.scala:69)
  at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang(ProcessBuilderImpl.scala:113)
  at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:129)
  at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
  ... 50 elided
Caused by: java.io.IOException: error=13, Permission denied
  at java.lang.UNIXProcess.forkAndExec(Native Method)
  at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
  at java.lang.ProcessImpl.start(ProcessImpl.java:134)
  at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
  ... 54 more

默认情况下不起作用,因为该文件未标记为可执行文件。

$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rw-r--r--  1 jacek  staff  33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh

(我不知道你是否可以通知Spark或YARN使文件可执行)。

让文件可执行。

scala> s"chmod +x /Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
res2: String = ""

它确实是一个可执行的shell脚本。

$ ls -l /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh
-rwxr-xr-x  1 jacek  staff  33 22 paź 21:51 /Users/jacek/.sparkStaging/application_1508700955259_0003/hello.sh

让我们执行它。

scala> s"/Users/jacek/.sparkStaging/${sc.applicationId}/hello.sh".!!
+ echo 'Hello world'
res3: String =
"Hello world
"

考虑到以下hello.sh

,它运行良好
#!/bin/sh -x

echo "Hello world"