我有一个spark(1.2.1 v)作业,使用postgresql.Driver for scala将rdd的内容插入postgres:
rdd.foreachPartition(iter => {
//connect to postgres database on the localhost
val driver = "org.postgresql.Driver"
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
iter.foreach(row => {
val mapRequest = Utils.getInsertMap(row)
val query = Utils.getInsertRequest(squares_table, mapRequest)
try { statement.execute(query) }
catch {
case pe: PSQLException => println("exception caught: " + pe);
}
})
connection.close()
})
在上面的代码中,我为rdd的每个分区打开了与postgres的新连接并关闭它。我认为正确的方法是使用连接池来postgres,我可以从中获取连接(如here所述),但它只是伪代码:
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
使用spark连接池连接到postgres的正确方法是什么?
答案 0 :(得分:0)
此代码将适用于spark 2或磨碎机版本和scala,首先您必须添加spark jdbc驱动程序。
如果您正在使用Maven,则可以按照这种方式工作。将此设置添加到您的pom文件
<dependency>
<groupId>postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.1-901-1.jdbc4</version>
</dependency>
将此代码写入scala文件
import org.apache.spark.sql.SparkSession
object PostgresConnection {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
val prop = new java.util.Properties
prop.setProperty("driver","org.postgresql.Driver")
prop.setProperty("user", "username")
prop.setProperty("password", "password")
val url = "jdbc:postgresql://127.0.0.1:5432/databaseName"
val df = spark.read.jdbc(url, "table_name",prop)
println(df.show(5))
}
}