当前设置-Azure数据工厂管道计划每15分钟运行一次,在始终处于交互式数据块群集上运行一些Databricks笔记本。
这里面临的问题是-4-5次运行后,该管道失败。由于Spark Driver的问题。没有可以导致驱动程序内存填满的Collect语句。 当驱动程序尝试将信息写入内部metastore(由Databricks自动管理)时,错误日志显示问题。该线程会导致违反GC开销限制并导致Full GC。结果驱动程序被杀死,Notebook运行失败。
以下是日志-
19/11/06 04:56:47 ERROR DatabricksMain$DBUncaughtExceptionHandler: Uncaught exception in thread db-atomic-read-worker-5095!
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596)
at java.lang.StringBuilder.append(StringBuilder.java:190)
at java.io.ObjectInputStream$BlockDataInputStream.readUTFSpan(ObjectInputStream.java:3506)
at java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:3414)
at java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:3226)
at java.io.ObjectInputStream.readString(ObjectInputStream.java:1905)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1564)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at java.util.Hashtable.readObject(Hashtable.java:1213)
at sun.reflect.GeneratedMethodAccessor36.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2178)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
at org.apache.commons.lang3.SerializationUtils.clone(SerializationUtils.java:94)
at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:370)
at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:366)
at java.lang.ThreadLocal$ThreadLocalMap.<init>(ThreadLocal.java:391)
at java.lang.ThreadLocal$ThreadLocalMap.<init>(ThreadLocal.java:298)
at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:255)
at java.lang.Thread.init(Thread.java:420)
at java.lang.Thread.init(Thread.java:349)
at java.lang.Thread.<init>(Thread.java:511)
at sun.security.ssl.SSLSocketImpl$NotifyHandshakeThread.<init>(SSLSocketImpl.java:2675)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1096)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
19/11/06 04:56:47 ERROR DatabricksMain$DBUncaughtExceptionHandler: OutOfMemoryError in thread db-atomic-read-worker-5095! Killing thread now.
19/11/06 04:56:47 WARN TrapExitSecurityManager: Called "System.exit(15)" in db-atomic-read-worker-5095!
Stack Trace:
java.lang.Thread.getStackTrace(Thread.java:1559)
com.databricks.backend.daemon.driver.TrapExitSecurityManager.checkExit(DriverLocal.scala:686)
java.lang.Runtime.halt(Runtime.java:273)
com.databricks.DatabricksMain$DBUncaughtExceptionHandler.uncaughtException(DatabricksMain.scala:363)
java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1057)
java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1052)
java.lang.Thread.dispatchUncaughtException(Thread.java:1959)
19/11/06 04:56:47 WARN TrapExitSecurityManager: Allowed to exit because this is OOM!
19/11/06 04:56:52 INFO StaticConf$: DB_HOME: /databricks
19/11/06 04:56:53 INFO DriverDaemon$: ========== driver starting up ==========
19/11/06 04:56:53 INFO DriverDaemon$: Java: Private Build 1.8.0_222
19/11/06 04:56:53 INFO DriverDaemon$: OS: Linux/amd64 4.15.0-1050-azure
19/11/06 04:56:53 INFO DriverDaemon$: CWD: /databricks/driver
非托管元存储库的连接性问题-
urrent allocation: Map(1414820437514047686 -> 1, 289483405015881873 -> 175)
Ideal allocation: Map(1414820437514047686 -> 88, 289483405015881873 -> 88)
Starved pools: Map(1414820437514047686 -> 98.420017518)
19/11/06 04:55:37 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 588 to 10.139.64.20:49530
19/11/06 04:55:29 ERROR BoneCP: Failed to acquire connection to jdbc:mariadb://consolidated-westeurope-prod-metastore-addl-1.mysql.database.azure.com:3306/organization4787651615040525?trustServerCertificate=true&useSSL=true. Sleeping for 7000 ms. Attempts left: 5
java.sql.SQLNonTransientConnectionException: Could not connect to consolidated-westeurope-prod-metastore-addl-1.mysql.database.azure.com:3306 : Connection reset
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:161)
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:106)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:1036)
at org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:490)
at org.mariadb.jdbc.MariaDbConnection.newConnection(MariaDbConnection.java:144)
at org.mariadb.jdbc.Driver.connect(Driver.java:90)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
at com.jolbox.bonecp.BoneCP.obtainInternalConnection(BoneCP.java:269)
at com.jolbox.bonecp.ConnectionHandle.<init>(ConnectionHandle.java:242)
at com.jolbox.bonecp.PoolWatchThread.fillConnections(PoolWatchThread.java:115)
at com.jolbox.bonecp.PoolWatchThread.run(PoolWatchThread.java:82)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLNonTransientConnectionException: Could not connect to consolidated-westeurope-prod-metastore-addl-1.mysql.database.azure.com:3306 : Connection reset
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:161)
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.connException(ExceptionMapper.java:79)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.handleConnectionPhases(AbstractConnectProtocol.java:724)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:402)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.connectWithoutProxy(AbstractConnectProtocol.java:1032)
... 13 more
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at sun.security.ssl.SSLSocketImpl.waitForClose(SSLSocketImpl.java:1761)
at sun.security.ssl.HandshakeOutStream.flush(HandshakeOutStream.java:124)
at sun.security.ssl.Handshaker.kickstart(Handshaker.java:1079)
at sun.security.ssl.SSLSocketImpl.kickstartHandshake(SSLSocketImpl.java:1479)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1346)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at org.mariadb.jdbc.internal.protocol.AbstractConnectProtocol.handleConnectionPhases(AbstractConnectProtocol.java:676)
... 15 more
19/11/06 04:55:37 WARN PreemptionMonitor: Preempted 43/43 tasks from 289483405015881873.
19/11/06 04:55:53 WARN PreemptionMonitor: Attempting to preempt 43 tasks from overallocated pools.
19/11/06 04:55:53 INFO PreemptionMonitor: Current allocation state:
Current max parallelism: 176
我很乐意回答任何问题-