我正在尝试在Google DataProc上使用H2O Sparkling Water。我已经在一个独立的Spark上成功运行Sparkling Water,现在继续在DataProc上使用它。最初,我收到一条关于spark.dynamicAllocation.enabled
没有得到支持的错误,所以我继续使用主人并开始这样......
pyspark \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false
启动Sparkling Water的互动看起来像这样,一旦阶段达到30000左右,它开始磨损,然后在30分钟左右后出现一系列错误:
>>> from pysparkling import *
>>> import h2o
>>> hc = H2OContext.getOrCreate(spark)
18/04/11 11:56:08 WARN org.apache.spark.h2o.backends.internal.InternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
18/04/11 11:56:08 WARN org.apache.spark.h2o.backends.internal.InternalH2OBackend: Due to non-deterministic behavior of Spark broadcast-based joins
We recommend to disable them by
configuring `spark.sql.autoBroadcastJoinThreshold` variable to value `-1`:
sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
[Stage 0:=================> (35346 + 11) / 100001]
我尝试过各种各样的事情: - 部署小型(3个节点)。 - 部署30个工作集群。 - 尝试运行DataProc图像1.1(Spark 2.0),1.2(Spark 2.2)和预览(Spark 2.2)。
还尝试了各种Spark选项:
spark.ext.h2o.fail.on.unsupported.spark.param=false \
spark.ext.h2o.nthreads=2
spark.ext.h2o.cluster.size=2
spark.ext.h2o.default.cluster.size=2
spark.ext.h2o.hadoop.memory=50m
spark.ext.h2o.repl.enabled=false
spark.ext.h2o.flatfile=false
spark.dynamicAllocation.enabled=false
spark.executor.memory=700m
任何人都对Google DataProc上的H2O有好运吗?
详细错误包括:
18/04/11 12:08:40 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1523445048432_0005_01_000006 on host: cluster-dev-w-0.c.trust-networks.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1523445048432_0005_01_000006
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:972)
at org.apache.hadoop.util.Shell.run(Shell.java:869)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1170)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:236)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:305)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:84)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 1
18/04/11 12:08:48 ERROR org.apache.spark.network.server.TransportRequestHandler: Error sending result RpcResponse{requestId=5571077381947066483, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /10.154.0.12:59387; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
以后:
Exception in thread "task-result-getter-3" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.Class.newReflectionData(Class.java:2513)
at java.lang.Class.reflectionData(Class.java:2503)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2660)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:403)
at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:394)
at java.security.AccessController.doPrivileged(Native Method)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:393)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
答案 0 :(得分:2)
好的,我想我自己解决了这个问题。 Sparkling Water根据Google DataProc中非默认设置的许多设置来分配资源。
我修改了/etc/spark/conf/spark-defaults.conf
,并将spark.dynamicAllocation.enabled
更改为false
并将spark.ext.h2o.dummy.rdd.mul.factor
更改为1
,这使得H2O群集可在约3分钟内启动大约十分之一的资源。
如果启动速度太慢,请尝试将spark.executor.instances
从10000
缩减为5000
或1000
,尽管这些设置会影响其他所有内容的效果&# 39;重新在Spark集群上运行。
答案 1 :(得分:1)
你得到了java.lang.OutOfMemoryError。给予更多记忆。