从大型Spark Dataframe到H2O Dataframe的H2O苏打水错误

时间:2017-06-13 17:24:34

标签: apache-spark h2o sparklyr sparkling-water

当我尝试从spark数据帧转换为H2O数据框时,我得到以下错误。这似乎与数据帧的大小有关,因为当我使它变小时,spark和H2O之间的转换器运行良好。

是否需要更改任何配置才能使用苏打水将大型火花数据帧转换为H2O?在我的配置中,我允许驱动程序和执行程序获得最大内存,因此这不是内存问题。

我在这里使用R代码是:

training<-as_h2o_frame(sc, final1, strict_version_check = FALSE)

错误:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 95.1 failed 4 times, most recent failure: Lost task 4.3 in stage 95.1 (TID 4050, 10.0.0.9): java.lang.ArrayIndexOutOfBoundsException: 65535
                at water.DKV.get(DKV.java:202)
                at water.DKV.get(DKV.java:175)
                at water.Key.get(Key.java:83)
                at water.fvec.Frame.createNewChunks(Frame.java:896)
                at water.fvec.FrameUtils$class.createNewChunks(FrameUtils.scala:43)
                at water.fvec.FrameUtils$.createNewChunks(FrameUtils.scala:70)
                at org.apache.spark.h2o.backends.internal.InternalWriteConverterCtx.createChunks(InternalWriteConverterCtx.scala:29)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$.org$apache$spark$h2o$converters$SparkDataFrameConverter$$perSQLPartition(SparkDataFrameConverter.scala:95)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:74)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:74)
                at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
                at org.apache.spark.scheduler.Task.run(Task.scala:86)
                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
                at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
                at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
                at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
                at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
                at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
                at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
                at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
                at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
                at scala.Option.foreach(Option.scala:257)
                at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
                at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
                at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
                at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
                at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
                at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
                at org.apache.spark.SparkContext.runJob(SparkContext.scala:1906)
                at org.apache.spark.h2o.converters.WriteConverterCtxUtils$.convert(WriteConverterCtxUtils.scala:83)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$.toH2OFrame(SparkDataFrameConverter.scala:74)
                at org.apache.spark.h2o.H2OContext.asH2OFrame(H2OContext.scala:145)
                at org.apache.spark.h2o.H2OContext.asH2OFrame(H2OContext.scala:143)
                at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:498)
                at sparklyr.Invoke$.invoke(invoke.scala:102)
                at sparklyr.StreamHandler$.handleMethodCall(stream.scala:89)
                at sparklyr.StreamHandler$.read(stream.scala:54)
                at sparklyr.BackendHandler.channelRead0(handler.scala:49)
                at sparklyr.BackendHandler.channelRead0(handler.scala:14)
                at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
                at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
                at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
                at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
                at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
                at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
                at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
                at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
                at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
                at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
                at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
                at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
                at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
                at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
                at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
                at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
                at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
                at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 65535
                at water.DKV.get(DKV.java:202)
                at water.DKV.get(DKV.java:175)
                at water.Key.get(Key.java:83)
                at water.fvec.Frame.createNewChunks(Frame.java:896)
                at water.fvec.FrameUtils$class.createNewChunks(FrameUtils.scala:43)
                at water.fvec.FrameUtils$.createNewChunks(FrameUtils.scala:70)
                at org.apache.spark.h2o.backends.internal.InternalWriteConverterCtx.createChunks(InternalWriteConverterCtx.scala:29)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$.org$apache$spark$h2o$converters$SparkDataFrameConverter$$perSQLPartition(SparkDataFrameConverter.scala:95)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:74)
                at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:74)
                at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
                at org.apache.spark.scheduler.Task.run(Task.scala:86)
                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                ... 1 more

1 个答案:

答案 0 :(得分:0)

打算重新发布Jakub的评论,以便更轻松地找到它:

似乎您的H2O云未正确初始化。请在此处查看自述文件github.com/h2oai/rsparkling#spark-connection