Scala Spark:汇总所有行中的所有列

时间:2020-02-28 17:25:10

标签: scala apache-spark

我可以很容易地做到这一点

df.groupBy().sum()

但是我不确定groupBy()是否不会增加性能影响,或者仅仅是不好的风格。我已经看到完成了

df.agg( ("col1", "sum"), ("col2", "sum"), ("col3", "sum"))

哪个跳过了(我认为不必要的groupBy),但是有其自身的丑陋之处。正确的方法是什么?使用.groupBy().<aggOp>()与使用.agg有什么内在的区别吗?

1 个答案:

答案 0 :(得分:2)

如果您检查 Physical plan ,则这两个查询都会在内部触发相同的计划,因此我们可以使用其中任何一个!

我认为使用 df.groupBy().sum() 会很方便,因为我们不需要指定所有列名。

Example:

val df=Seq((1,2,3),(4,5,6)).toDF("id","j","k")

scala> df.groupBy().sum().explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
      +- LocalTableScan [id#7, j#8, k#9]

scala> df.agg(sum("id"),sum("j"),sum("k")).explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
      +- LocalTableScan [id#7, j#8, k#9]