"由于阶段失败而导致工作中止"在SparkR

时间:2017-05-23 04:28:11

标签: r apache-spark sparkr

按照这篇文章中的说明(https://spark.apache.org/docs/latest/sparkr.html#from-local-data-frames),我使用以下代码创建了一个sparkdataframe:

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "spark://master:7077", sparkConfig = list(spark.cores.max="8", spark.executor.cores = "4"))
data(iris)
iris = createDataFrame(iris)
head(iris)

但是head函数总是会导致以下错误。当我尝试运行dim时,我也收到同样的错误。我也尝试过as.DataFrame而不是createDataFrame。我也尝试在ipython笔记本中重新启动内核并重新启动我的spark会话。

我的理解是这是SparkR的一个非常基本的功能,所以我真的不知道为什么它不起作用。出于某种原因,当我使用SparkDataFrame从数据源中读取read.jdbc时,我没有问题。另外,我注意到错误行中的数字":阶段XXX中的任务0 .."每次失败时加1。

我还注意到,错误似乎来自于执行程序无法找到Rscript的二进制文件这一事实,尽管我不确定为什么这只会发生在SparkDataFrames上这是从本地data.frames创建的,而不是从外部数据源中提取的。

有人可以帮帮我吗?

完整的错误堆栈跟踪是:

  

FUN(X [[i]],...)中的警告消息:“使用Sepal_Length而不是   Sepal.Length作为列名“FUN中的警告消息(X [[i]],...):”使用   Sepal_Width而不是Sepal.Width作为列名“警告消息   FUN(X [[i]],...):“使用Petal_Length而不是Petal.Length作为列   名称“FUN中的警告信息(X [[i]],...):”使用Petal_Width而不是   Petal.Width为列名“

     

invokeJava出错(isStatic = TRUE,   className,methodName,...):org.apache.spark.SparkException:Job   由于阶段失败而中止:阶段45.0中的任务0失败了4次,   最近的失败:阶段45.0中的丢失任务0.3(TID 3372,10.0.0.5):   java.io.IOException:无法运行程序" Rscript":error = 2,没有这样的   文件或目录在   java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)at   org.apache.spark.api.r.RRunner $ .createRProcess(RRunner.scala:348)at at   org.apache.spark.api.r.RRunner $ .createRWorker(RRunner.scala:364)at at   org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)at at   org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)at at   org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)at at   org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)at at   org.apache.spark.scheduler.Task.run(Task.scala:85)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:748)引起:   java.io.IOException:error = 2,没有这样的文件或目录   java.lang.UNIXProcess.forkAndExec(Native Method)at   java.lang.UNIXProcess。(UNIXProcess.java:247)at   java.lang.ProcessImpl.start(ProcessImpl.java:134)at   java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)... 24更多

     

驱动程序堆栈跟踪:at   org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1450)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1438)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1437)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:811)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:811)     在scala.Option.foreach(Option.scala:257)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)at at   org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)     在   org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)     在   org.apache.spark.sql.Dataset $$ anonfun $ $组织阿帕奇$火花$ SQL $数据集$$执行$ 1 $ 1.适用(Dataset.scala:2183)     在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:57)     在   org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)     在   org.apache.spark.sql.Dataset.org $阿帕奇$火花$ SQL $数据集$$执行$ 1(Dataset.scala:2182)     在   org.apache.spark.sql.Dataset $$ anonfun $ $组织阿帕奇$火花$ SQL $数据集$$收集$ 1.适用(Dataset.scala:2187)     在   org.apache.spark.sql.Dataset $$ anonfun $ $组织阿帕奇$火花$ SQL $数据集$$收集$ 1.适用(Dataset.scala:2187)     在org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)at   org.apache.spark.sql.Dataset.org $阿帕奇$火花$ SQL $数据集$$收集(Dataset.scala:2187)     在org.apache.spark.sql.Dataset.collect(Dataset.scala:2163)at at   org.apache.spark.sql.api.r.SQLUtils $ .dfToCols(SQLUtils.scala:208)at at   org.apache.spark.sql.api.r.SQLUtils.dfToCols(SQLUtils.scala)at at   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)     在   org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)     在   org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)     在   io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)     在   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)     在   io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)     在   io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)     在   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)     在   io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)     在   io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)     在   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)     在   io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)     在   io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)     在   io.netty.channel.nio.AbstractNioByteChannel $ NioByteUnsafe.read(AbstractNioByteChannel.java:131)     在   io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)     在   io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)     在   io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)     在io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)at   io.netty.util.concurrent.SingleThreadEventExecutor $ 2.run(SingleThreadEventExecutor.java:111)     在   io.netty.util.concurrent.DefaultThreadFactory $ DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)     在java.lang.Thread.run(Thread.java:745)引起:   java.io.IOException:无法运行程序" Rscript":error = 2,没有这样的   文件或目录在   java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)at   org.apache.spark.api.r.RRunner $ .createRProcess(RRunner.scala:348)at at   org.apache.spark.api.r.RRunner $ .createRWorker(RRunner.scala:364)at at   org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)at at   org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)at at   org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)at at   org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:283)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)     在org.apache.spark Traceback:

     
      
  1. 头(charEx)
  2.   
  3. 头(charEx)
  4.   
  5. .local(x,...)
  6.   
  7. take(x,num)
  8.   
  9. take(x,num)
  10.   
  11. 收集(有限的)
  12.   
  13. 收集(有限的)
  14.   
  15. .local(x,...)
  16.   
  17. callJStatic(" org.apache.spark.sql.api.r.SQLUtils"," dfToCols",。x @ sdf)
  18.   
  19. invokeJava(isStatic = TRUE,className,methodName,...)
  20.   
  21. 止动件(readString(康涅狄格州))
  22.   

1 个答案:

答案 0 :(得分:0)

这就是我理解的方式:

read.jdbc工作的原因是,在节点上,执行操作不需要R:驱动程序(运行R)将命令转换为{{1}在它被复制到工作节点上并执行之前。

Spark失败的原因是因为它被createDataFrame命令复制到工作节点,因此节点需要有权访问R

如果您想使用Rscript来稍微使用您的数据,我建议您使用本地createDataFrame(否则您必须将Sparksession复制到您的工作人员节点)。如果您需要首先通过Rscript将数据传入R,您可能需要重新考虑(您可能正在使用Spark因为您拥有大量数据,通常最好在Spark侧加载并保留所有内容,然后将聚合的块提取到内存中的Spark