在纱线上运行很长时间后,Spark Job失败

时间:2018-05-25 10:55:20

标签: apache-spark

嗨,我有一个我正在运行的火花工作。它进行word2vec和逻辑回归的超参数优化。我有一个ML管道,如下所示:

windowSize = [5,10]
minCount = [5,10]
maxIter= [10,100,1000]
regParam= [0.1,0.01]

######################################################################################

pipeline=Pipeline(stages=[transformer_filtered_question1,transformer_filtered_question2,token_q1,token_q2,remover_q1,remover_q2,
                          transformer_textlength_q1,transformer_textlength_q2,transformer_totalwords,
                          transformer_commonwords,transformer_difftwolength,
                          transformer_fuzz_qratio,transformer_fuzz_partial_token_setratio,
                          transformer_fuzz_partial_token_sortratio,transformer_fuzz_token_setratio,
                          transformer_fuzz_token_sortratio,transformer_fuzz_partialratio,transformer_fuzz_wratio,
                          q1w2model,q2w2model,
                          transformer_manhattan, transformer_braycurtis, transformer_canberra,
                          transformer_cosine,transformer_euclidean,
                          transformer_jaccard,transformer_minkowski,transformer_kurtosis_q1,
                          transformer_kurtosis_q2,transformer_skew_q1,transformer_skew_q2,
                          assembler,lr])

# paramGrid only takes list of values not integers

paramGrid = ParamGridBuilder() \
    .addGrid(q1w2model.windowSize,windowSize) \
    .addGrid(q1w2model.minCount,minCount) \
    .addGrid(q2w2model.windowSize,windowSize) \
    .addGrid(q2w2model.minCount,minCount) \
    .addGrid(lr.maxIter,maxIter) \
    .addGrid(lr.regParam, regParam) \
    .build()


evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction', labelCol='label', metricName='areaUnderROC')


tvs = TrainValidationSplit(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          trainRatio=0.8)

我正在使用1个Master和4个Core r4.2xlarge实例,8CPU和61 GB。 最初,当我运行Job时,我遇到了“内存不足”的错误。我能够通过将驱动程序内存增加到45G来解决这个问题。之后我遇到了“通过对等错误重置连接”,所以我在配置中添加了以下两行:

 "spark.network.timeout":"10000000",
 "spark.executor.heartbeatInterval":"10000000"

以下是我正在使用的最终配置:

"spark.executor.memory": "12335M",
 "spark.executor.cores": "2",
 "spark.executor.instances" : "19",
 "spark.yarn.executor.memoryOverhead" : "1396", 
 "spark.default.parallelism" : "38",  
 "spark.driver.memory": "45G",
 "spark.network.timeout":"10000000",
 "spark.executor.heartbeatInterval":"10000000"

以下是我现在得到的错误回溯:

ERROR TransportClient: Failed to send RPC 9047984567231060528 to /172.31.4.221:36662: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
18/05/25 10:07:33 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(1,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 9047984567231060528 to /172.31.4.221:36662: java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:733)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:725)
    at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:35)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1062)
    at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1116)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1051)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
18/05/25 10:07:33 WARN ExecutorAllocationManager: Uncaught exception in thread spark-dynamic-executor-allocation
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:572)
    at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:380)
    at org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:331)
    at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:281)
    at org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:225)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

0 个答案:

没有答案