Question

我想在我的spark rdd上做mapPartitions，

    val newRd = myRdd.mapPartitions(
      partition => {

        val connection = new DbConnection /*creates a db connection per partition*/

        val newPartition = partition.map(
           record => {
             readMatchingFromDB(record, connection)
         })
        connection.close()
        newPartition
      })

但是，这给了我一个连接已经关闭的异常，正如预期的那样，因为在控件到达.map()之前我的connection已关闭。我想为每个RDD分区创建一个连接，并正确关闭它。我怎样才能做到这一点？

谢谢！

Answer 1

正如讨论here中所提到的 - 问题源于迭代器partition上的地图操作的懒惰。这种懒惰意味着对于每个分区，创建并关闭连接，并且仅在稍后（当对RDD起作用时），调用readMatchingFromDB。

要解决此问题，您应该在关闭连接之前强制执行迭代器的热切遍历，例如：通过将其转换为列表（然后返回）：

val newRd = myRdd.mapPartitions(partition => {
  val connection = new DbConnection /*creates a db connection per partition*/

  val newPartition = partition.map(record => {
    readMatchingFromDB(record, connection)
  }).toList // consumes the iterator, thus calls readMatchingFromDB 

  connection.close()
  newPartition.iterator // create a new iterator
})

Answer 2

rdd.foreachPartitionAsync(iterator->{

// this object will be cached inside each executor JVM. For the first time, the //connection will be created and hence forward, it will be reused. 
// Very useful for streaming apps
DBConn conn=DBConn.getConnection()
while(iterator.hasNext()) {
  conn.read();
}

});

public class DBConn{
private static dbObj=null;

//Create a singleton method that returns only one instance of this object
}

}

Spark：每个Spark RDD分区的数据库连接，并执行mapPartition

2 个答案: