在我的 spark 作业中,我使用 jdbc 批处理将记录插入 MySQL。但我注意到所有记录都没有进入 MySQL。例如;
//count records before insert
println(s"dataframe: ${dataframe.count()}")
dataframe.foreachPartition(partition => {
Class.forName(jdbcDriver)
val dbConnection: Connection = DriverManager.getConnection(jdbcUrl, username, password)
var preparedStatement: PreparedStatement = null
dbConnection.setAutoCommit(false)
val batchSize = 100
partition.grouped(batchSize).foreach(batch => {
batch.foreach(row => {
val productName = row.getString(row.fieldIndex("productName"))
val quantity = row.getLong(row.fieldIndex("quantity"))
val sqlString =
s"""
|INSERT INTO myDb.product (productName, quantity)
|VALUES (?, ?)
""".stripMargin
preparedStatement = dbConnection.prepareStatement(sqlString)
preparedStatement.setString(1, productName)
preparedStatement.setLong(2, quantity)
preparedStatement.addBatch()
})
preparedStatement.executeBatch()
dbConnection.commit()
preparedStatement.close()
})
dbConnection.close()
})
我在 dataframe.count
中看到 650 条记录,但是当我检查 mysql 时,我看到 195 条记录。这是确定性的。我尝试了不同的批量大小,但仍然看到相同的数字。但是,当我将 preparedStatement.executeBatch()
移到 batch.foreach()
内,即在 preparedStatement.addBatch()
之后的下一行时,我看到 mysql 中的完整 650 条记录......它不再批处理插入语句,因为它立即执行它在一次迭代中添加它之后。阻止批处理查询的问题可能是什么?
答案 0 :(得分:2)
您似乎在每次迭代中都创建了一个新的 preparedStatement
,这意味着 preparedStatement.executeBatch()
仅应用于最后一批,即 195 条记录而不是 650 条记录。相反,您应该创建一个 PreparedStatement 然后替换迭代中的参数,如下所示:
dataframe.foreachPartition(partition => {
Class.forName(jdbcDriver)
val dbConnection: Connection = DriverManager.getConnection(jdbcUrl, username, password)
val sqlString =
s"""
|INSERT INTO myDb.product (productName, quantity)
|VALUES (?, ?)
""".stripMargin
var preparedStatement: PreparedStatement = dbConnection.prepareStatement(sqlString)
dbConnection.setAutoCommit(false)
val batchSize = 100
partition.grouped(batchSize).foreach(batch => {
batch.foreach(row => {
val productName = row.getString(row.fieldIndex("productName"))
val quantity = row.getLong(row.fieldIndex("quantity"))
preparedStatement = dbConnection.prepareStatement(sqlString)
preparedStatement.setString(1, productName)
preparedStatement.setLong(2, quantity)
preparedStatement.addBatch()
})
preparedStatement.executeBatch()
dbConnection.commit()
preparedStatement.close()
})
dbConnection.close()
})