在我的应用程序中,我比较了两个不同的数据集(即来自Hive的源表和来自RDBMS的目标)的重复和错误匹配,它适用于较小的数据集但是当我尝试比较数据时更多1GB(仅源代码)它挂起并抛出func()
,我尝试了TIMEOUT ERROR
,即使在增加网络超时后也会.config("spark.network.timeout", "600s")
。{/ p>
java.lang.OutOfMemoryError: GC overhead limit exceeded
我也试过 val spark = SparkSession.builder().master("local")
.appName("spark remote")
.config("javax.jdo.option.ConnectionURL", "jdbc:mysql://192.168.175.160:3306/metastore?useSSL=false")
.config("javax.jdo.option.ConnectionUserName", "hiveroot")
.config("javax.jdo.option.ConnectionPassword", "hivepassword")
.config("hive.exec.scratchdir", "/tmp/hive/${user.name}")
.config("hive.metastore.uris", "thrift://192.168.175.160:9083")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val source = spark.sql("SELECT * from sample.source").rdd.map(_.mkString(","))
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
val sparkdestination = SparkSession.builder().master("local").appName("Database")
.config("spark.network.timeout", "600s")
.getOrCreate()
val jdbcUsername = "root"
val jdbcPassword = "root"
val url = "jdbc:mysql://192.168.175.35:3306/sample?useSSL=false"
val connectionProperties = new java.util.Properties()
connectionProperties.put("user", jdbcUsername)
connectionProperties.put("password", jdbcPassword)
val queryDestination = "(select * from destination) as dest"
val destination = sparkdestination.read.jdbc(url, queryDestination, connectionProperties).rdd.map(_.mkString(","))
(MEMORY_AND_DISK,DISK_ONLY)方法,但没有运气。
编辑:这是原始错误堆栈:
destination.persist(StorageLevel.MEMORY_AND_DISK_SER)
编辑2:
我尝试使用:
17/07/11 12:49:43 INFO DAGScheduler: Submitting 22 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[13] at map at stack.scala:76)
17/07/11 12:49:43 INFO TaskSchedulerImpl: Adding task set 1.0 with 22 tasks
17/07/11 12:49:43 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/07/11 12:51:38 INFO JDBCRDD: closed connection
17/07/11 12:51:38 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2210)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3410)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3112)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2341)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2736)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2490)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:301)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
17/07/11 12:51:38 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2210)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
17/07/11 12:49:43 INFO DAGScheduler: Submitting 22 missing tasks from ShuffleMapStage 1 (MapPartitionsRDD[13] at map at stack.scala:76)
17/07/11 12:49:43 INFO TaskSchedulerImpl: Adding task set 1.0 with 22 tasks
17/07/11 12:49:43 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/07/11 12:51:38 INFO JDBCRDD: closed connection
17/07/11 12:51:38 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2210)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3410)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3112)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2341)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2736)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2490)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:301)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
17/07/11 12:51:38 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2210)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1989)
我检查了少量数据,但是对于大型数据集,它将错误抛给了
val options = Map(
"url" -> "jdbc:mysql://192.168.175.35:3306/sample?useSSL=false",
"dbtable" -> queryDestination,
"user" -> "root",
"password" -> "root")
val destination = sparkdestination.read.options(options).jdbc(options("url"), options("dbtable"), "0", 1, 5, 4, new java.util.Properties()).rdd.map(_.mkString(","))
ERROR
17/07/11 14:12:46 INFO DAGScheduler: looking for newly runnable stages
17/07/11 14:12:46 INFO DAGScheduler: running: Set(ShuffleMapStage 1)
17/07/11 14:12:46 INFO DAGScheduler: waiting: Set(ResultStage 2)
17/07/11 14:12:46 INFO DAGScheduler: failed: Set()
17/07/11 14:12:50 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.175.160:39913 in memory (size: 19.9 KB, free: 353.4 MB)
17/07/11 14:14:47 WARN ServerConnector:
17/07/11 14:15:32 WARN QueuedThreadPool:
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.String.substring(String.java:1969)
17/07/11 14:15:32 ERROR Utils: uncaught error in thread Spark Context Cleaner, stopping SparkContext
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:179)
17/07/11 14:15:32 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver, [Lscala.Tuple2;@1e855db,BlockManagerId (driver, 192.168.175.160, 39913, None))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
17/07/11 14:15:32 ERROR Utils: throw uncaught fatal error in thread Spark Context Cleaner
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:179)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1245)
17/07/11 14:15:32 WARN QueuedThreadPool: Unexpected thread death: org.spark_project.jetty.util.thread.QueuedThreadPool$3@710104 in SparkUI{STARTED,8<=8<=200,i=5,q=0}
17/07/11 14:15:32 INFO JDBCRDD: closed connection
17/07/11 14:15:32 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 22)
java.lang.OutOfMemoryError: GC overhead limit exceeded
17/07/11 14:15:32 INFO SparkUI: Stopped Spark web UI at http://192.168.175.160:4040
17/07/11 14:15:32 INFO DAGScheduler: Job 0 failed: collect at stack.scala:93, took 294.365864 s
Exception in thread "main" org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:808)
17/07/11 14:15:32 INFO DAGScheduler: ShuffleMapStage 1 (map at stack.scala:85) failed in 294.165 s due to Stage cancelled because SparkContext was shut down
17/07/11 14:15:32 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@cfb906)
17/07/11 14:15:32 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1499762732342,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down))
17/07/11 14:15:32 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[Executor task launch worker-1,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
17/07/11 14:15:32 INFO DiskBlockManager: Shutdown hook called
17/07/11 14:15:32 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/07/11 14:15:32 INFO ShutdownHookManager: Shutdown hook called
17/07/11 14:15:32 INFO MemoryStore: MemoryStore cleared
17/07/11 14:15:32 INFO BlockManager: BlockManager stopped
17/07/11 14:15:32 INFO BlockManagerMaster: BlockManagerMaster stopped
17/07/11 14:15:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-0b2ea8bd-95c0-45e4-a1cc-bd62b3899b24
17/07/11 14:15:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-0b2ea8bd-95c0-45e4-a1cc-bd62b3899b24/userFiles-194d73ba-fcfa-4616-ae17-78b0bba6b465
17/07/11 14:15:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
我在开发模式下使用2g内存和1个内核进行执行。对于这样一个天真的问题,我很新兴。
谢谢。!
答案 0 :(得分:3)
首先,您正在启动两个SparkSession
,这是非常无用的,您只是拆分资源。所以不要这样做!
其次,这就是问题所在。有关Apache Spark的并行性和jdbc
源的误解(不用担心,这是一个问题!)。
这主要是由于缺少文档。 (我最后一次检查)
回到问题所在。实际发生的是以下一行:
val destination = spark.read.jdbc(url, queryDestination, connectionProperties).rdd.map(_.mkString(","))
是它将读取委托给单个工作者。
所以主要是,如果你有足够的内存并且你成功地读取了这些数据。整个destination
数据将位于一个分区中。并且一个分区意味着麻烦! a.k.a可能:
java.lang.OutOfMemoryError: GC overhead limit exceeded
所以发生的事情是,被选中获取数据的单个执行程序不堪重负并且它的JVM爆炸了。
现在让我们解决这个问题:
( 免责声明 :以下代码摘自 spark-gotchas ,我是其作者之一。)< / p>
因此,让我们创建一些示例数据并将其保存在我们的数据库中:
val options = Map(
"url" -> "jdbc:postgresql://127.0.0.1:5432/spark",
"dbtable" -> "data",
"driver" -> "org.postgresql.Driver",
"user" -> "spark",
"password" -> "spark"
)
val newData = spark.range(1000000)
.select($"id", lit(""), lit(true), current_timestamp())
.toDF("id", "name", "valid", "ts")
newData.write.format("jdbc").options(options).mode("append").save
Apache Spark提供了两种方法,用于通过JDBC加载分布式数据。第一个使用整数列分区数据:
val dfPartitionedWithRanges = spark.read.options(options)
.jdbc(options("url"), options("dbtable"), "id", 1, 5, 4, new java.util.Properties())
dfPartitionedWithRanges.rdd.partitions.size
// Int = 4
dfPartitionedWithRanges.rdd.glom.collect
// Array[Array[org.apache.spark.sql.Row]] = Array(
// Array([1,foo,true,2012-01-01 00:03:00.0]),
// Array([2,foo,false,2013-04-02 10:10:00.0]),
// Array([3,bar,true,2015-11-02 22:00:00.0]),
// Array([4,bar,false,2010-11-02 22:00:00.0]))
Partition column and bounds can provided using options as well:
val optionsWithBounds = options ++ Map(
"partitionColumn" -> "id",
"lowerBound" -> "1",
"upperBound" -> "5",
"numPartitions" -> "4"
)
spark.read.options(optionsWithBounds).format("jdbc").load
分区列和边界也可以使用选项提供:
val optionsWithBounds = options ++ Map(
"partitionColumn" -> "id",
"lowerBound" -> "1",
"upperBound" -> "5",
"numPartitions" -> "4"
)
spark.read.options(optionsWithBounds).format("jdbc").load
另一个选择是使用一系列谓词,但我不会在这里谈论它。
您可以阅读有关Spark SQL和JDBC Source here以及其他一些问题的更多信息。
我希望这会有所帮助。