我是Spark编程新手。 我试图使用Pipe运算符来嵌入外部程序(一组包含已编译的C程序,bash,Python脚本的文件)。 代码如下所示:
sc.addFile("hdfs://afolder",true)
val infile = sc.textFile("afile.txt").pipe("afolder/abash.sh").take(3)
abash.sh将调用其他脚本和程序在afile.txt上执行以下操作。
此代码在spark本地模式下运行良好。但是当我尝试以纱线模式(客户端或群集)进行部署时,我失败了以下消息。**
WARN scheduler.TaskSetManager:阶段1.0中的丢失任务0.0(TID 4, 数据库):java.io.IOException:无法运行程序" afolder / abash.sh": 错误= 13,权限被拒绝
文件夹的所有子目录和文件都已成功下载到本地spark tmp目录中(在我的例子中,这是/ usr / local / hadoop / spark /) 在第一次失败后,我递归地为hdfs中的文件夹设置了777权限。不过,我也有同样的错误。
任何想法如何解决?感谢。
输出错误:
> 16/05/18 16:04:09 INFO storage.MemoryStore: Block broadcast_2 stored
> as values in memory (estimated size 212.1 KB, free 212.1 KB) 16/05/18
> 16:04:09 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as
> bytes in memory (estimated size 19.5 KB, free 231.6 KB) 16/05/18
> 16:04:09 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in
> memory on 210.107.197.201:42777 (size: 19.5 KB, free: 511.1 MB)
> 16/05/18 16:04:09 INFO spark.SparkContext: Created broadcast 2 from
> textFile at <console>:27 16/05/18 16:04:09 INFO
> mapred.FileInputFormat: Total input paths to process : 1 16/05/18
> 16:04:09 INFO spark.SparkContext: Starting job: take at <console>:27
> 16/05/18 16:04:09 INFO scheduler.DAGScheduler: Got job 1 (take at
> <console>:27) with 1 output partitions 16/05/18 16:04:09 INFO
> scheduler.DAGScheduler: Final stage: ResultStage 1 (take at
> <console>:27) 16/05/18 16:04:09 INFO scheduler.DAGScheduler: Parents
> of final stage: List() 16/05/18 16:04:09 INFO scheduler.DAGScheduler:
> Missing parents: List() 16/05/18 16:04:09 INFO scheduler.DAGScheduler:
> Submitting ResultStage 1 (PipedRDD[5] at pipe at <console>:27), which
> has no missing parents 16/05/18 16:04:09 INFO storage.MemoryStore:
> Block broadcast_3 stored as values in memory (estimated size 3.7 KB,
> free 235.3 KB) 16/05/18 16:04:09 INFO storage.MemoryStore: Block
> broadcast_3_piece0 stored as bytes in memory (estimated size 2.2 KB,
> free 237.5 KB) 16/05/18 16:04:09 INFO storage.BlockManagerInfo: Added
> broadcast_3_piece0 in memory on 210.107.197.201:42777 (size: 2.2 KB,
> free: 511.1 MB) 16/05/18 16:04:09 INFO spark.SparkContext: Created
> broadcast 3 from broadcast at DAGScheduler.scala:1006 16/05/18
> 16:04:09 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from
> ResultStage 1 (PipedRDD[5] at pipe at <console>:27) 16/05/18 16:04:09
> INFO cluster.YarnScheduler: Adding task set 1.0 with 1 tasks 16/05/18
> 16:04:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0
> (TID 4, database, partition 0,NODE_LOCAL, 2603 bytes) 16/05/18
> 16:04:11 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in
> memory on database:51757 (size: 2.2 KB, free: 511.1 MB) 16/05/18
> 16:04:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0
> (TID 4, database): java.io.IOException: Cannot run program
> "afolder/abash.sh": error=13, Permission denied
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: error=13, Permission denied
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 9 more
>
> 16/05/18 16:04:11 INFO scheduler.TaskSetManager: Starting task 0.1 in
> stage 1.0 (TID 5, database, partition 0,NODE_LOCAL, 2603 bytes)
> 16/05/18 16:04:12 INFO storage.BlockManagerInfo: Added
> broadcast_3_piece0 in memory on database:52395 (size: 2.2 KB, free:
> 511.1 MB) 16/05/18 16:04:12 INFO scheduler.TaskSetManager: Lost task 0.1 in stage 1.0 (TID 5) on executor database: java.io.IOException (Cannot run program "afolder/abash.sh": error=13, Permission denied)
> [duplicate 1] 16/05/18 16:04:12 INFO scheduler.TaskSetManager:
> Starting task 0.2 in stage 1.0 (TID 6, database, partition
> 0,NODE_LOCAL, 2603 bytes) 16/05/18 16:04:12 INFO
> scheduler.TaskSetManager: Lost task 0.2 in stage 1.0 (TID 6) on
> executor database: java.io.IOException (Cannot run program
> "afolder/abash.sh": error=13, Permission denied) [duplicate 2]
> 16/05/18 16:04:12 INFO scheduler.TaskSetManager: Starting task 0.3 in
> stage 1.0 (TID 7, database, partition 0,NODE_LOCAL, 2603 bytes)
> 16/05/18 16:04:12 INFO scheduler.TaskSetManager: Lost task 0.3 in
> stage 1.0 (TID 7) on executor database: java.io.IOException (Cannot
> run program "afolder/abash.sh": error=13, Permission denied)
> [duplicate 3] 16/05/18 16:04:12 ERROR scheduler.TaskSetManager: Task 0
> in stage 1.0 failed 4 times; aborting job 16/05/18 16:04:12 INFO
> cluster.YarnScheduler: Removed TaskSet 1.0, whose tasks have all
> completed, from pool 16/05/18 16:04:12 INFO cluster.YarnScheduler:
> Cancelling stage 1 16/05/18 16:04:12 INFO scheduler.DAGScheduler:
> ResultStage 1 (take at <console>:27) failed in 2.955 s 16/05/18
> 16:04:12 INFO scheduler.DAGScheduler: Job 1 failed: take at
> <console>:27, took 2.963885 s org.apache.spark.SparkException: Job
> aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most
> recent failure: Lost task 0.3 in stage 1.0 (TID 7, database):
> java.io.IOException: Cannot run program "afolder/abash.sh": error=13,
> Permission denied
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: error=13, Permission denied
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 9 more
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
> at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
> at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1328)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.take(RDD.scala:1302)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:32)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:34)
> at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:36)
> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:38)
> at $iwC$$iwC$$iwC.<init>(<console>:40)
> at $iwC$$iwC.<init>(<console>:42)
> at $iwC.<init>(<console>:44)
> at <init>(<console>:46)
> at .<init>(<console>:50)
> at .<clinit>(<console>)
> at .<init>(<console>:7)
> at .<clinit>(<console>)
> at $print(<console>)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.IOException: Cannot run program "afolder/abash.sh":
> error=13, Permission denied
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: error=13, Permission denied
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 9 more
答案 0 :(得分:2)
尝试chmod +x afolder/abash.sh
答案 1 :(得分:1)
像这样更改代码:pipe(“./ afolder / abash.sh”) 并确保abash.sh是可删除的