有没有办法让spark上下文从hdfs读取可执行文件?

时间:2015-02-24 04:02:44

标签: apache-spark

我已将可执行文件放入HDFS。我想在每个Spark工作者上阅读它,以便通过它运行RDD的输出。有没有办法做到这一点?这类似于sc.addFile(“program”)。不幸的是sc.addFile不起作用,因为以下行:

JavaRDD<String> output = data.pipe("Program"):

产生以下异常:

TaskSetManager: Lost task 2.0 in stage 0.0 (TID 3,  compute14.dev):
java.io.IOException: Cannot run program "Program": error=2, No  such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
    at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
    at java.lang.ProcessImpl.start(ProcessImpl.java:130)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
    ... 9 more

1 个答案:

答案 0 :(得分:0)

rdd.pipe()的参数是本地文件系统中可执行程序的路径,所以我认为您可以做的一件事就是从HDFS读取程序,将其写入本地文件中的某个位置系统(在每个执行器上!)然后使用rdd.pipe()

特别是,我认为您可以使用binaryFiles()从HDFS加载数据,然后在执行程序上加载collect(),将其存储在广播变量中,然后像

那样进行处理
rdd.foreachPartition(_ => /* write contents of broadcast variable
                             to some local file */ )
rdd.pipe( /* path of that local file */ )