Spark 1.6 DataFrame优化联接分区

时间:2018-09-05 10:38:45

标签: apache-spark dataframe apache-spark-sql rdd

我对Spark DataFrame分区有疑问,我目前正在使用Spark 1.6满足项目要求。这是我的代码摘录:

sqlContext.getConf("spark.sql.shuffle.partitions") // 6

val df = sc.parallelize(List(("A",1),("A",4),("A",2),("B",5),("C",2),("D",2),("E",2),("B",7),("C",9),("D",1))).toDF("id_1","val_1")
df.rdd.getNumPartitions // 4

val df2 = sc.parallelize(List(("B",1),("E",4),("H",2),("J",5),("C",2),("D",2),("F",2))).toDF("id_2","val_2")
df2.rdd.getNumPartitions // 4

val df3 = df.join(df2,$"id_1" === $"id_2")
df3.rdd.getNumPartitions // 6

val df4 = df3.repartition(3,$"id_1")
df4.rdd.getNumPartitions // 3

df4.explain(true)

以下是已创建的解释计划:

== Parsed Logical Plan ==
'RepartitionByExpression ['id_1], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
   :- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
   :  +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
   +- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
      +- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26

== Analyzed Logical Plan ==
id_1: string, val_1: int, id_2: string, val_2: int
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
   :- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
   :  +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
   +- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
      +- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26

== Optimized Logical Plan ==
RepartitionByExpression [id_1#42], Some(3)
+- Join Inner, Some((id_1#42 = id_2#46))
   :- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
   :  +- LogicalRDD [_1#40,_2#41], MapPartitionsRDD[169] at rddToDataFrameHolder at <console>:26
   +- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
      +- LogicalRDD [_1#44,_2#45], MapPartitionsRDD[173] at rddToDataFrameHolder at <console>:26

== Physical Plan ==
TungstenExchange hashpartitioning(id_1#42,3), None
+- SortMergeJoin [id_1#42], [id_2#46]
   :- Sort [id_1#42 ASC], false, 0
   :  +- TungstenExchange hashpartitioning(id_1#42,6), None
   :     +- Project [_1#40 AS id_1#42,_2#41 AS val_1#43]
   :        +- Scan ExistingRDD[_1#40,_2#41] 
   +- Sort [id_2#46 ASC], false, 0
      +- TungstenExchange hashpartitioning(id_2#46,6), None
         +- Project [_1#44 AS id_2#46,_2#45 AS val_2#47]
            +- Scan ExistingRDD[_1#44,_2#45]

据我所知,DataFrame代表RDD上的抽象接口,因此应将分区委托给 Catalyst 优化器。

RDD相比,实际上,其中许多转换接受多个分区参数,以便尽可能地优化共分区和共定位,其中DataFrame唯一改变分区的机会是调用方法重新分区,否则使用配置参数spark.sql.shuffle.partitions推断联接和聚合的分区数。

从上面的解释计划中我可以看到和理解,似乎有一个无用的重新分区(实际上是随机的)到6(默认值),然后再次重新分区为最终施加的最终值通过方法重新分区。

我相信优化程序可以将联接的分区数更改为最终值3。

有人可以帮助我阐明这一点吗?也许我错过了一些东西。

1 个答案:

答案 0 :(得分:0)

如果使用spark sql,则shuffle分区始终等于spark.sql.shufle.partitions。但是如果启用了spark.sql.adaptive.enabled,它将添加EchangeCoordinator。现在,此协调器的工作是确定需要从一个或多个阶段中获取洗牌数据的阶段的洗牌后分区数量。