Question

在我的 spark 作业中，我使用 jdbc 批处理将记录插入 MySQL。但我注意到所有记录都没有进入 MySQL。例如;

//count records before insert
println(s"dataframe: ${dataframe.count()}")

dataframe.foreachPartition(partition => {

  Class.forName(jdbcDriver)
  val dbConnection: Connection = DriverManager.getConnection(jdbcUrl, username, password)

  var preparedStatement: PreparedStatement = null
  dbConnection.setAutoCommit(false)
  val batchSize = 100

  partition.grouped(batchSize).foreach(batch => {
    batch.foreach(row => {
      val productName = row.getString(row.fieldIndex("productName"))
      val quantity = row.getLong(row.fieldIndex("quantity"))
      val sqlString =
        s"""
           |INSERT INTO myDb.product (productName, quantity)
           |VALUES (?, ?)
          """.stripMargin

      preparedStatement = dbConnection.prepareStatement(sqlString)
      preparedStatement.setString(1, productName)
      preparedStatement.setLong(2, quantity)

      preparedStatement.addBatch()
    })

    preparedStatement.executeBatch()
    dbConnection.commit()
    preparedStatement.close()
  })
  dbConnection.close()
})

我在 dataframe.count 中看到 650 条记录，但是当我检查 mysql 时，我看到 195 条记录。这是确定性的。我尝试了不同的批量大小，但仍然看到相同的数字。但是，当我将 preparedStatement.executeBatch() 移到 batch.foreach() 内，即在 preparedStatement.addBatch() 之后的下一行时，我看到 mysql 中的完整 650 条记录......它不再批处理插入语句，因为它立即执行它在一次迭代中添加它之后。阻止批处理查询的问题可能是什么？

Answer 1

您似乎在每次迭代中都创建了一个新的 preparedStatement，这意味着 preparedStatement.executeBatch() 仅应用于最后一批，即 195 条记录而不是 650 条记录。相反，您应该创建一个 PreparedStatement 然后替换迭代中的参数，如下所示：

dataframe.foreachPartition(partition => {

  Class.forName(jdbcDriver)
  val dbConnection: Connection = DriverManager.getConnection(jdbcUrl, username, password)

  val sqlString =
        s"""
           |INSERT INTO myDb.product (productName, quantity)
           |VALUES (?, ?)
          """.stripMargin

  var preparedStatement: PreparedStatement = dbConnection.prepareStatement(sqlString)

  dbConnection.setAutoCommit(false)
  val batchSize = 100

  partition.grouped(batchSize).foreach(batch => {
    batch.foreach(row => {
      val productName = row.getString(row.fieldIndex("productName"))
      val quantity = row.getLong(row.fieldIndex("quantity"))
      

      preparedStatement = dbConnection.prepareStatement(sqlString)
      preparedStatement.setString(1, productName)
      preparedStatement.setLong(2, quantity)

      preparedStatement.addBatch()
    })

    preparedStatement.executeBatch()
    dbConnection.commit()
    preparedStatement.close()
  })
  dbConnection.close()
})

Spark jdbc批处理未插入所有记录

1 个答案: