H2O苏打水经常抛到异常以下,我们会在发生这种情况时手动重新运行。问题是火花作业在发生此异常时不会退出,它们不会返回退出状态,我们无法自动执行此过程。
App > Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 316 in stage 22.0 failed 4 times, most recent failure: Lost task 316.3 in stage 22.0 (TID 9470, ip-**-***-***-**.ec2.internal): java.lang.ArrayIndexOutOfBoundsException: 65535
App > at water.DKV.get(DKV.java:202)
App > at water.DKV.get(DKV.java:175)
App > at water.Key.get(Key.java:83)
App > at water.fvec.Frame.createNewChunks(Frame.java:896)
App > at water.fvec.FrameUtils$class.createNewChunks(FrameUtils.scala:43)
App > at water.fvec.FrameUtils$.createNewChunks(FrameUtils.scala:70)
App > at org.apache.spark.h2o.backends.internal.InternalWriteConverterContext.createChunks(InternalWriteConverterContext.scala:28)
App > at org.apache.spark.h2o.converters.SparkDataFrameConverter$class.org$apache$spark$h2o$converters$SparkDataFrameConverter$$perSQLPartition(SparkDataFrameConverter.scala:86)
App > at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:67)
App > at org.apache.spark.h2o.converters.SparkDataFrameConverter$$anonfun$toH2OFrame$1$$anonfun$apply$2.apply(SparkDataFrameConverter.scala:67)
App > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
App > at org.apache.spark.scheduler.Task.run(Task.scala:85)
App > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
App > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
App > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
答案 0 :(得分:0)
本期正在调查以下有关Sparkling Water项目的问题:
似乎某种程度上与数据的大小有关。
当我们尝试将巨大的火花数据帧拉到h2o帧时会发生这种情况。 63m记录x 6300列。 虽然H2O / Sparkling Water集群的大小合适:(每个执行器有40个x 17g,每个Spark执行器有4个线程/核心) 因此总内存量为680Gb
我们从未在较小的数据集上出现此错误。