嗨,我有一个我正在运行的火花工作。它进行word2vec和逻辑回归的超参数优化。我有一个ML管道,如下所示:
windowSize = [5,10]
minCount = [5,10]
maxIter= [10,100,1000]
regParam= [0.1,0.01]
######################################################################################
pipeline=Pipeline(stages=[transformer_filtered_question1,transformer_filtered_question2,token_q1,token_q2,remover_q1,remover_q2,
transformer_textlength_q1,transformer_textlength_q2,transformer_totalwords,
transformer_commonwords,transformer_difftwolength,
transformer_fuzz_qratio,transformer_fuzz_partial_token_setratio,
transformer_fuzz_partial_token_sortratio,transformer_fuzz_token_setratio,
transformer_fuzz_token_sortratio,transformer_fuzz_partialratio,transformer_fuzz_wratio,
q1w2model,q2w2model,
transformer_manhattan, transformer_braycurtis, transformer_canberra,
transformer_cosine,transformer_euclidean,
transformer_jaccard,transformer_minkowski,transformer_kurtosis_q1,
transformer_kurtosis_q2,transformer_skew_q1,transformer_skew_q2,
assembler,lr])
# paramGrid only takes list of values not integers
paramGrid = ParamGridBuilder() \
.addGrid(q1w2model.windowSize,windowSize) \
.addGrid(q1w2model.minCount,minCount) \
.addGrid(q2w2model.windowSize,windowSize) \
.addGrid(q2w2model.minCount,minCount) \
.addGrid(lr.maxIter,maxIter) \
.addGrid(lr.regParam, regParam) \
.build()
evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction', labelCol='label', metricName='areaUnderROC')
tvs = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.8)
我正在使用1个Master和4个Core r4.2xlarge实例,8CPU和61 GB。 最初,当我运行Job时,我遇到了“内存不足”的错误。我能够通过将驱动程序内存增加到45G来解决这个问题。之后我遇到了“通过对等错误重置连接”,所以我在配置中添加了以下两行:
"spark.network.timeout":"10000000",
"spark.executor.heartbeatInterval":"10000000"
以下是我正在使用的最终配置:
"spark.executor.memory": "12335M",
"spark.executor.cores": "2",
"spark.executor.instances" : "19",
"spark.yarn.executor.memoryOverhead" : "1396",
"spark.default.parallelism" : "38",
"spark.driver.memory": "45G",
"spark.network.timeout":"10000000",
"spark.executor.heartbeatInterval":"10000000"
以下是我现在得到的错误回溯:
ERROR TransportClient: Failed to send RPC 9047984567231060528 to /172.31.4.221:36662: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
18/05/25 10:07:33 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(1,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 9047984567231060528 to /172.31.4.221:36662: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
18/05/25 10:07:33 WARN ExecutorAllocationManager: Uncaught exception in thread spark-dynamic-executor-allocation
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:572)
at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:380)
at org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:331)
at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:281)
at org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:225)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)