Spark:每个Spark RDD分区的数据库连接,并执行mapPartition

时间:2016-06-17 11:59:59

标签: scala apache-spark rdd

我想在我的spark rdd上做mapPartitions,

    val newRd = myRdd.mapPartitions(
      partition => {

        val connection = new DbConnection /*creates a db connection per partition*/

        val newPartition = partition.map(
           record => {
             readMatchingFromDB(record, connection)
         })
        connection.close()
        newPartition
      })

但是,这给了我一个连接已经关闭的异常,正如预期的那样,因为在控件到达.map()之前我的connection已关闭。我想为每个RDD分区创建一个连接,并正确关闭它。我怎样才能做到这一点?

谢谢!

2 个答案:

答案 0 :(得分:6)

正如讨论here中所提到的 - 问题源于迭代器partition上的地图操作的懒惰。这种懒惰意味着对于每个分区,创建并关闭连接,并且仅在稍后(当对RDD起作用时),调用readMatchingFromDB

要解决此问题,您应该在关闭连接之前强制执行迭代器的热切遍历,例如:通过将其转换为列表(然后返回):

val newRd = myRdd.mapPartitions(partition => {
  val connection = new DbConnection /*creates a db connection per partition*/

  val newPartition = partition.map(record => {
    readMatchingFromDB(record, connection)
  }).toList // consumes the iterator, thus calls readMatchingFromDB 

  connection.close()
  newPartition.iterator // create a new iterator
})

答案 1 :(得分:0)

rdd.foreachPartitionAsync(iterator->{

// this object will be cached inside each executor JVM. For the first time, the //connection will be created and hence forward, it will be reused. 
// Very useful for streaming apps
DBConn conn=DBConn.getConnection()
while(iterator.hasNext()) {
  conn.read();
}

});

public class DBConn{
private static dbObj=null;

//Create a singleton method that returns only one instance of this object
}

}