我想在我的spark rdd上做mapPartitions,
val newRd = myRdd.mapPartitions(
partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(
record => {
readMatchingFromDB(record, connection)
})
connection.close()
newPartition
})
但是,这给了我一个连接已经关闭的异常,正如预期的那样,因为在控件到达.map()
之前我的connection
已关闭。我想为每个RDD分区创建一个连接,并正确关闭它。我怎样才能做到这一点?
谢谢!
答案 0 :(得分:6)
正如讨论here中所提到的 - 问题源于迭代器partition
上的地图操作的懒惰。这种懒惰意味着对于每个分区,创建并关闭连接,并且仅在稍后(当对RDD起作用时),调用readMatchingFromDB
。
要解决此问题,您应该在关闭连接之前强制执行迭代器的热切遍历,例如:通过将其转换为列表(然后返回):
val newRd = myRdd.mapPartitions(partition => {
val connection = new DbConnection /*creates a db connection per partition*/
val newPartition = partition.map(record => {
readMatchingFromDB(record, connection)
}).toList // consumes the iterator, thus calls readMatchingFromDB
connection.close()
newPartition.iterator // create a new iterator
})
答案 1 :(得分:0)
rdd.foreachPartitionAsync(iterator->{
// this object will be cached inside each executor JVM. For the first time, the //connection will be created and hence forward, it will be reused.
// Very useful for streaming apps
DBConn conn=DBConn.getConnection()
while(iterator.hasNext()) {
conn.read();
}
});
public class DBConn{
private static dbObj=null;
//Create a singleton method that returns only one instance of this object
}
}