我已将可执行文件放入HDFS。我想在每个Spark工作者上阅读它,以便通过它运行RDD的输出。有没有办法做到这一点?这类似于sc.addFile(“program”)。不幸的是sc.addFile不起作用,因为以下行:
JavaRDD<String> output = data.pipe("Program"):
产生以下异常:
TaskSetManager: Lost task 2.0 in stage 0.0 (TID 3, compute14.dev):
java.io.IOException: Cannot run program "Program": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 9 more
答案 0 :(得分:0)
rdd.pipe()
的参数是本地文件系统中可执行程序的路径,所以我认为您可以做的一件事就是从HDFS读取程序,将其写入本地文件中的某个位置系统(在每个执行器上!)然后使用rdd.pipe()
。
特别是,我认为您可以使用binaryFiles()
从HDFS加载数据,然后在执行程序上加载collect()
,将其存储在广播变量中,然后像
rdd.foreachPartition(_ => /* write contents of broadcast variable
to some local file */ )
rdd.pipe( /* path of that local file */ )