Question

我对Spark批处理中的顺序处理有疑问。这是我试图获取答案的简单化形式。

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Simple Dataframe Processing")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

val df = spark.read.json("devices.json")

// Displays the content of the DataFrame to stdout
df.show()

// +-------------------------+
// | device-guid|   Operation|
// +----+-------+-------------
// |1234        |   Add 3    |
// |1234        |   Sub 3    |
// |1234        |   Add 2    |
// |1234        |   Sub 2    |
// |1234        |   Add 1    |
// |1234        |   Sub 1    |
// +----+-------+------------+


//I have a Database with one table with following columns
//  device-guid (primary key)   result


//I would like to take df and for each row in the df do a update operation to a single DB row, Adding or removing number as described in Operation column
//So the result I am expecting at the end of this in the DB is a single row with 

// device-guid      result
// 1234             0


df.foreach { row => 
          UpdateDB(row)  //Update the DB with the row's Operation. 
                        //Actual method not shown
    }

让我们说我在YARN的Spark集群中运行此程序，它具有5个执行程序，在5个工作程序节点上各有2个核心。 Spark中的什么能保证UpdateDB操作按数据帧中的行顺序进行调度和执行，而不是EVER并行进行调度和执行？

即我一直想在数据库的结果列中得到0的答案。

从广义上讲，问题是：“即使有多个执行者和内核，也能保证对数据帧上的操作进行顺序处理？”

您能指出一些将按顺序处理这些任务的Spark文档吗？

是否需要设置任何Spark属性才能使其正常工作？

此致

Venkat

Answer 1

从广义上讲，问题是：“即使有多个执行者和内核，也能保证对数据帧上的操作进行顺序处理？”

除了完全没有并行性外，什么都没有，或者只有一个分区。

单个内核可能会产生类似的效果，但不能保证特定的块顺序。

如果您确实需要顺序处理，那么您使用的是错误的工具。

Spark批处理中的顺序处理

1 个答案: