SparkSession关闭导致无休止的阻塞

时间:2017-09-09 02:45:21

标签: java apache-spark spark-dataframe

我有一个java spark应用程序(使用spark-sql_2.11 2.1.1)从hbase加载和缓存数据,经过一些过程然后插入hive表,之后,最后关闭这个spark会话,但有时这个应用程序被阻塞在调用SparkSession.close()时,经过对其线程转储的一些研究后,我发现了一个涉及SparkSession和Spark' ContextCleaner的锁,下面是堆栈片段:

    "pool-4-thread-4" #698 prio=5 os_prio=0 tid=0x00000000070c6000 nid=0x5075 waiting for monitor entry [0x00007fcc4a0cd000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at org.apache.spark.ContextCleaner.stop(ContextCleaner.scala:142)
    - waiting to lock <0x00000000dc2f6d90> (a org.apache.spark.ContextCleaner)
    at org.apache.spark.SparkContext$$anonfun$stop$4$$anonfun$apply$mcV$sp$3.apply(SparkContext.scala:1817)
    at org.apache.spark.SparkContext$$anonfun$stop$4$$anonfun$apply$mcV$sp$3.apply(SparkContext.scala:1817)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.SparkContext$$anonfun$stop$4.apply$mcV$sp(SparkContext.scala:1817)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1816)
    at org.apache.spark.sql.SparkSession.stop(SparkSession.scala:665)
    at org.apache.spark.sql.SparkSession.close(SparkSession.scala:673)
    at a.b.c.d.e.Job$1.done(Job.java:21)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

"Spark Context Cleaner" #2585 daemon prio=5 os_prio=0 tid=0x0000000023eef800 nid=0x1b66 waiting on condition [0x00007fcc5597e000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000e36ce078> (a scala.concurrent.impl.Promise$CompletionLatch)
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:190)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
    at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:151)
    at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:303)
    at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
    at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
    at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238)
    at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)
    at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)
    - locked <0x00000000dc2f6d90> (a org.apache.spark.ContextCleaner)
    at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1245)
    at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)
    at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)

如您所见,Spark ContextCleaner暂停0x00000000dc2f6d90并等待result,state:TIMED_WAITING,但它从未转义(几天后),导致SparkSession.close {{1} }。

下面的

是我的密码摘录:

BLOCKED(waiting to lock <0x00000000dc2f6d90>)
SparkSession session = null; Dataset<Row> hBaseCache = null; try { session = initSparkSession(); //cache hbase dataset hBaseCache = cacheHBaseData(); //dosomething hBaseCache.map(...); hBaseCache.write().mode(SaveMode.Overwrite).insertInto("xxx_hive_table"); } finally { if (hbaseContent != null) { hbaseContent.unpersist(); } if (session != null) { session.close(); } } 启动unpersist()之前,

sparkSession.close()似乎ContextCleaner而且它TIMED_WAITING没有工作...... 我的应用程序使用默认的Rpc超时("spark.rpc.askTimeout", "spark.network.timeout"), "120s")

请分享您对此问题的看法

结论:案件结案 抱歉你的时间,我的不好,我太急于检查代码,结果有人从Ops覆盖生产env的Rpctimeout到一个非常大的价值......

0 个答案:

没有答案