Question

我正在读《火花权威指南》这本书上面有一个例子。

val myRange = spark.range(1000).toDF("number")
val divisBy2 = myRange.where("number % 2 = 0")
divisBy2.count()

下面是三行代码的描述。

we started a Spark job that runs our filter transformation (a narrow
transformation), then an aggregation (a wide transformation) that performs the counts on a per
partition basis, and then a collect, which brings our result to a native object in the respective
language

我知道计数是一个操作而不是转换，因为它返回一个实际值，并且我无法调用“ 解释” count的返回值。

但是我想知道为什么计数会引起广泛的转化，在这种情况下，我怎么知道这个 count 的执行计划，因为计数后我不能调用'explain'

谢谢。

已更新：

此图片是spark ui屏幕截图，我从databricks笔记本中获取，我说过有一个随机的写和读操作，这是否意味着要进行广泛的转换？

Answer 1

这是执行计划：

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#7L])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#10L])
      +- *(1) Project
         +- *(1) Filter ((id#0L % 2) = 0)
            +- *(1) Range (0, 1000, step=1, splits=8)

我们在这里可以看到的内容：

在每个分区内进行计数
所有分区合并为一个分区
进行最终计数

对Spark数据集中的计数方法有疑问吗？

1 个答案: