Question

Apache Spark SQL是否支持类似于Oracle的MERGE SQL子句的MERGE子句？

MERGE into <table> using (
  select * from <table1>
    when matched then update...
       DELETE WHERE...
    when not matched then insert...
)

Answer 1

没有。截至目前（可能在将来发生变化）Spark不支持UPDATES，DELETES或任何其他记录修改变体。

它只能覆盖现有存储（具有不同的实现，具体取决于源）或附加普通INSERT。

Answer 2

如果您正在使用Spark，那么this个答案可能会帮助您使用DataFrames来解决合并问题。

无论如何，阅读Hortonworks的some文档，它说Apache Hive 0.14及更高版本支持Merge句子。

Answer 3

它以Delta Lake作为存储格式：df.write.format("delta").save("/data/events")。

DeltaTable.forPath(spark, "/data/events/")
  .as("events")
  .merge(
    updatesDF.as("updates"),
    "events.eventId = updates.eventId")
  .whenMatched
  .updateExpr(
    Map("data" -> "updates.data"))
  .whenNotMatched
  .insertExpr(
    Map(
      "date" -> "updates.date",
      "eventId" -> "updates.eventId",
      "data" -> "updates.data"))
  .execute()

您还需要增量软件包：

<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.11</artifactId>
  <version>xxxx</version>
</dependency>

有关更多详细信息，请参见https://docs.delta.io/0.4.0/delta-update.html

Answer 4

您可以编写您的自定义代码：在下面的代码中，您可以编辑以合并而不是插入。确保这是计算量大的操作。但得到y

  df.rdd.coalesce(2).foreachPartition(partition => {
  val connectionProperties = brConnect.value
  val jdbcUrl = connectionProperties.getProperty("jdbcurl")
  val user = connectionProperties.getProperty("user")
  val password = connectionProperties.getProperty("password")
  val driver = connectionProperties.getProperty("Driver")
  Class.forName(driver)

  val dbc: Connection = DriverManager.getConnection(jdbcUrl, user, password)
  val db_batchsize = 1000
  var pstmt: PreparedStatement = null

  partition.grouped(db_batchsize).foreach(batch => {
    batch.foreach{ row =>
      {
        val id = row.id
        val fname = row.fname
        val lname = row.lname
        val userid = row.userid
        println(id, fname)
        val sqlString = "INSERT employee USING   " +
        " values (?, ?, ?, ?) "

        var pstmt: PreparedStatement = dbc.prepareStatement(sqlString)
        pstmt.setLong(1, row.id)
        pstmt.setString(2, row.fname)
        pstmt.setString(3, row.lname)
        pstmt.setString(4, row.userid)
        pstmt.addBatch()
        pstmt.executeBatch()
      }

    }
    //pstmt.executeBatch()
    dbc.commit()
    pstmt.close()
  })
  dbc.close()
} )

Answer 5

从 Spark 3.0 开始，Spark 提供了一种使用 spark delta 表进行合并操作的非常干净的方法。 https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge

Apache Spark SQL是否支持MERGE子句？

5 个答案: