SparkSession.close()
时,经过对其线程转储的一些研究后,我发现了一个涉及SparkSession和Spark' ContextCleaner
的锁,下面是堆栈片段:
"pool-4-thread-4" #698 prio=5 os_prio=0 tid=0x00000000070c6000 nid=0x5075 waiting for monitor entry [0x00007fcc4a0cd000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.spark.ContextCleaner.stop(ContextCleaner.scala:142)
- waiting to lock <0x00000000dc2f6d90> (a org.apache.spark.ContextCleaner)
at org.apache.spark.SparkContext$$anonfun$stop$4$$anonfun$apply$mcV$sp$3.apply(SparkContext.scala:1817)
at org.apache.spark.SparkContext$$anonfun$stop$4$$anonfun$apply$mcV$sp$3.apply(SparkContext.scala:1817)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext$$anonfun$stop$4.apply$mcV$sp(SparkContext.scala:1817)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1816)
at org.apache.spark.sql.SparkSession.stop(SparkSession.scala:665)
at org.apache.spark.sql.SparkSession.close(SparkSession.scala:673)
at a.b.c.d.e.Job$1.done(Job.java:21)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"Spark Context Cleaner" #2585 daemon prio=5 os_prio=0 tid=0x0000000023eef800 nid=0x1b66 waiting on condition [0x00007fcc5597e000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000e36ce078> (a scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:151)
at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:303)
at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
- locked <0x00000000dc2f6d90> (a org.apache.spark.ContextCleaner)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1245)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)
如您所见,Spark ContextCleaner
暂停0x00000000dc2f6d90
并等待result,state:TIMED_WAITING
,但它从未转义(几天后),导致SparkSession.close
{{1} }。
是我的密码摘录:
BLOCKED(waiting to lock <0x00000000dc2f6d90>)
在 SparkSession session = null;
Dataset<Row> hBaseCache = null;
try {
session = initSparkSession();
//cache hbase dataset
hBaseCache = cacheHBaseData();
//dosomething
hBaseCache.map(...);
hBaseCache.write().mode(SaveMode.Overwrite).insertInto("xxx_hive_table");
} finally {
if (hbaseContent != null) {
hbaseContent.unpersist();
}
if (session != null) {
session.close();
}
}
启动unpersist()
之前, sparkSession.close()
似乎ContextCleaner
而且它TIMED_WAITING
没有工作......
我的应用程序使用默认的Rpc超时("spark.rpc.askTimeout", "spark.network.timeout"), "120s")
请分享您对此问题的看法
结论:案件结案 抱歉你的时间,我的不好,我太急于检查代码,结果有人从Ops覆盖生产env的Rpctimeout到一个非常大的价值......