Question

I’m writing a data source that shares similarities with Spark’s JDBC data source implementation, and I’d like to ask how Spark handles certain failure scenarios. To my understanding, if an executor dies while it’s running a task, Spark will revive the executor and try to re-run that task. However, how does this play out in the context of data integrity and Spark’s JDBC data source API (e.g. df.write.format("jdbc").option(...).save())?

In the savePartition function of JdbcUtils.scala, we see Spark calling the commit and rollback functions of the Java connection object generated from the database url/credentials provided by the user (see below). But if an executor dies right after commit() finishes or before rollback() is called, does Spark try to re-run the task and write the same data partition again, essentially creating duplicate committed rows in the database? And what happens if the executor dies in the middle of calling commit() or rollback()?

try {
    ...
    if (supportsTransactions) {
        conn.commit()
    }
    committed = true
    Iterator.empty
} catch {
    case e: SQLException =>
        ...
        throw e
} finally {
    if (!committed) {
        // The stage must fail.  We got here through an exception path, so
        // let the exception through unless rollback() or close() want to
        // tell the user about another problem.
        if (supportsTransactions) {
          conn.rollback()
        }
        conn.close()
    } else {
        ...
    }
}

Answer 1

I had to introduce some de-duplication logic exactly for the reasons described. You might end up with the same committed twice (or more) indeed.

Answer 2

但是，如果执行者在commit（）完成之后或调用rollback（）之前就去世了，Spark是否会尝试重新运行任务并再次写入相同的数据分区，从而在数据库中创建重复的提交行？

由于Spark SQL（这是RDD API上的高级API）对JDBC或所有其他协议的所有特性并不真正了解，您会期望什么？更不用说基础执行运行时，即Spark Core。

编写df.write.format(“jdbc”).option(...).save()之类的结构化查询时，Spark SQL使用类似于汇编的低级RDD API将其转换为分布式计算。由于它尝试包含尽可能多的“协议”（包括JDBC），因此Spark SQL的DataSource API将大部分错误处理留给了数据源本身。

计划任务（不知道甚至不关心任务做什么）的Spark核心只是监视执行，如果任务失败，它将再次尝试执行（默认情况下直到3次失败尝试）。

因此，当您编写自定义数据源时，您就知道演练，并且必须在代码中处理此类重试。

处理错误的一种方法是使用TaskContext（例如addTaskCompletionListener或addTaskFailureListener）注册任务侦听器。

How does Spark handle failure scenarios involving JDBC data source?

2 个答案: