Spark SQL连接三个数据帧的快速方法

时间:2018-08-02 13:00:54

标签: scala apache-spark apache-spark-sql

PARENT_DATA_FRAME:

+------------+------------+------------+------------+------------+
|key_col_0   |key_col_1   |key_col_2   |key_col_3   |val_0       |
+------------+------------+------------+------------+------------+
|key000000   |key000001   |key000002   |key000003   |val_0       |
|key000010   |key000011   |key000012   |key000013   |val_1       |
|key000020   |key000021   |key000022   |key000023   |val_2       |
|key000030   |key000031   |key000032   |key000033   |val_3       |
|key000040   |key000041   |key000042   |key000043   |val_4       |
+------------+------------+------------+------------+------------+

CHILD_A_DATA_FRAME:

+------------+------------+------------+------------+------------+
|key_col_0   |key_col_1   |key_col_2   |key_col_3   |val_0       |
+------------+------------+------------+------------+------------+
|key000000   |key000001   |key000002   |key000003   |val_0       |
|key000010   |key000011   |key000012   |key000013   |val_1       |
+------------+------------+------------+------------+------------+

CHILD_B_DATA_FRAME:

+------------+------------+------------+------------+------------+
|key_col_0   |key_col_1   |key_col_2   |key_col_3   |val_0       |
+------------+------------+------------+------------+------------+
|key000000   |key000001   |key000002   |key000003   |val_0       |
|key000020   |key000021   |key000022   |key000023   |val_2       |
+------------+------------+------------+------------+------------+

EXPECTED_RESULT:

+------------+------------+------------+------------+------------+----------------------------------------------------------+----------------------------------------------------------+
|key_col_0   |key_col_1   |key_col_2   |key_col_3   |val_0       |A_CHILD                                                   |B_CHILD                                                   |
+------------+------------+------------+------------+------------+----------------------------------------------------------+----------------------------------------------------------+
|key000000   |key000001   |key000002   |key000003   |val_0       |array([key000000,key000001,key000002,key000003,val_0])    |array([key000000,key000001,key000002,key000003,val_0])    |
|key000010   |key000011   |key000012   |key000013   |val_1       |array([|key000010,key000011,key000012,key000013,val_1])   |array()                                                   |
|key000020   |key000021   |key000022   |key000023   |val_2       |array()                                                   |array([|key000020,key000021,key000022,key000023,val_2])   |
|key000030   |key000031   |key000032   |key000033   |val_3       |array()                                                   |array()                                                   |
|key000040   |key000041   |key000042   |key000043   |val_4       |array()                                                   |array()                                                   |
+------------+------------+------------+------------+------------+----------------------------------------------------------+----------------------------------------------------------+

我要在上面的示例EXPECTED_RESULT中将PARENT,A_CHILD和B_CHILD的三个数据帧连接到一个数据帧。 我找到了解决方案,但速度很慢:

val parentDF = ...
val childADF = ...
val childBDF = ...

val aggregatedAColName = "CHILD_A"
val aggregatedBColName = "CHILD_B"

val columns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3", "val_0")
val keyColumns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3")

val nestedAColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedAColName)
val childADataFrame = childADF
  .select(nestedAColumns: _*)
  .repartition(keyColumns.map(col): _*)
  .groupBy(keyColumns.map(col): _*)
  .agg(collect_list(aggregatedAColName).alias(aggregatedAColName))
val joinedWithA = parentDF.join(childADataFrame, keyColumns, "left")

val nestedBColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedBColName)
val childBDataFrame = childBDF
  .select(nestedBColumns: _*)
  .repartition(keyColumns.map(col): _*)
  .groupBy(keyColumns.map(col): _*)
  .agg(collect_list(aggregatedBColName).alias(aggregatedBColName))
val joinedWithB = joinedWithA.join(childBDataFrame, keyColumns, "left")

我如何更快地做到呢?

1 个答案:

答案 0 :(得分:-1)

我们可以将这些数据帧转换为rdd,然后转换为Pair RDD。然后,我们可以使用leftOuterJoin两次。我们将具有以下类型的值。

((key000000,key000001,key000002,key000003,val_0),(1,Some(1),Some(1)))
((key000010,key000011,key000012,key000013,val_1),(1,Some(1),None))

,依此类推...然后可以将它们映射到所需的表单。希望这会有所帮助。