在单个组中进行所有聚合还是单独进行?

时间:2018-01-25 13:30:02

标签: performance apache-spark dataframe pyspark apache-spark-sql

我需要在PySpark代码中对大型数据集进行大量聚合(大约9-10)。我可以用两种方式来接近它:

单一组:

df.groupBy(col1, col2).agg({"col3":"sum", "col4":"avg", "col5":"min", "col6":"sum", "col7":"max", "col8":"avg", "col9":"sum"})

分组并加入

temp1 = df.groupBy(col1, col2).agg({"col3":"sum"})
temp2 = df.groupBy(col1, col2).agg({"col4":"avg"})
temp3 = df.groupBy(col1, col2).agg({"col5":"min"})
.
.
.
temp9 = df.groupBy(col1, col2).agg({"col9":"sum"})

然后加入所有这9个数据帧以获得最终输出。

哪一个更有效率?

1 个答案:

答案 0 :(得分:5)

TL; DR 选择第一个。

它甚至不是竞争对手。单独的可读性应该足以拒绝第二个解决方案,这个解决方案冗长而复杂。

更不用说,执行计划只是一个怪物(这里只有2个表!):

== Physical Plan ==
*Project [col1#512L, col2#513L, sum(col3)#597L, avg(col4)#614, min(col5)#631L]
+- *SortMergeJoin [col1#512L, col2#513L], [col1#719L, col2#720L], Inner
   :- *Project [col1#512L, col2#513L, sum(col3)#597L, avg(col4)#614]
   :  +- *SortMergeJoin [col1#512L, col2#513L], [col1#704L, col2#705L], Inner
   :     :- *Sort [col1#512L ASC NULLS FIRST, col2#513L ASC NULLS FIRST], false, 0
   :     :  +- *HashAggregate(keys=[col1#512L, col2#513L], functions=[sum(col3#514L)])
   :     :     +- Exchange hashpartitioning(col1#512L, col2#513L, 200)
   :     :        +- *HashAggregate(keys=[col1#512L, col2#513L], functions=[partial_sum(col3#514L)])
   :     :           +- *Project [_1#491L AS col1#512L, _2#492L AS col2#513L, _3#493L AS col3#514L]
   :     :              +- *Filter (isnotnull(_1#491L) && isnotnull(_2#492L))
   :     :                 +- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]
   :     +- *Sort [col1#704L ASC NULLS FIRST, col2#705L ASC NULLS FIRST], false, 0
   :        +- *HashAggregate(keys=[col1#704L, col2#705L], functions=[avg(col4#707L)])
   :           +- Exchange hashpartitioning(col1#704L, col2#705L, 200)
   :              +- *HashAggregate(keys=[col1#704L, col2#705L], functions=[partial_avg(col4#707L)])
   :                 +- *Project [_1#491L AS col1#704L, _2#492L AS col2#705L, _4#494L AS col4#707L]
   :                    +- *Filter (isnotnull(_2#492L) && isnotnull(_1#491L))
   :                       +- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]
   +- *Sort [col1#719L ASC NULLS FIRST, col2#720L ASC NULLS FIRST], false, 0
      +- *HashAggregate(keys=[col1#719L, col2#720L], functions=[min(col5#723L)])
         +- Exchange hashpartitioning(col1#719L, col2#720L, 200)
            +- *HashAggregate(keys=[col1#719L, col2#720L], functions=[partial_min(col5#723L)])
               +- *Project [_1#491L AS col1#719L, _2#492L AS col2#720L, _5#495L AS col5#723L]
                  +- *Filter (isnotnull(_1#491L) && isnotnull(_2#492L))
                     +- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]

与普通聚合(对于所有列)进行比较:

== Physical Plan ==
*HashAggregate(keys=[col1#512L, col2#513L], functions=[max(col7#518L), avg(col8#519L), sum(col3#514L), sum(col6#517L), sum(col9#520L), min(col5#516L), avg(col4#515L)])
+- Exchange hashpartitioning(col1#512L, col2#513L, 200)
   +- *HashAggregate(keys=[col1#512L, col2#513L], functions=[partial_max(col7#518L), partial_avg(col8#519L), partial_sum(col3#514L), partial_sum(col6#517L), partial_sum(col9#520L), partial_min(col5#516L), partial_avg(col4#515L)])
      +- *Project [_1#491L AS col1#512L, _2#492L AS col2#513L, _3#493L AS col3#514L, _4#494L AS col4#515L, _5#495L AS col5#516L, _6#496L AS col6#517L, _7#497L AS col7#518L, _8#498L AS col8#519L, _9#499L AS col9#520L]
         +- Scan ExistingRDD[_1#491L,_2#492L,_3#493L,_4#494L,_5#495L,_6#496L,_7#497L,_8#498L,_9#499L,_10#500L]