PARENT_DATA_FRAME:
+------------+------------+------------+------------+------------+
|key_col_0 |key_col_1 |key_col_2 |key_col_3 |val_0 |
+------------+------------+------------+------------+------------+
|key000000 |key000001 |key000002 |key000003 |val_0 |
|key000010 |key000011 |key000012 |key000013 |val_1 |
|key000020 |key000021 |key000022 |key000023 |val_2 |
|key000030 |key000031 |key000032 |key000033 |val_3 |
|key000040 |key000041 |key000042 |key000043 |val_4 |
+------------+------------+------------+------------+------------+
CHILD_A_DATA_FRAME:
+------------+------------+------------+------------+------------+
|key_col_0 |key_col_1 |key_col_2 |key_col_3 |val_0 |
+------------+------------+------------+------------+------------+
|key000000 |key000001 |key000002 |key000003 |val_0 |
|key000010 |key000011 |key000012 |key000013 |val_1 |
+------------+------------+------------+------------+------------+
CHILD_B_DATA_FRAME:
+------------+------------+------------+------------+------------+
|key_col_0 |key_col_1 |key_col_2 |key_col_3 |val_0 |
+------------+------------+------------+------------+------------+
|key000000 |key000001 |key000002 |key000003 |val_0 |
|key000020 |key000021 |key000022 |key000023 |val_2 |
+------------+------------+------------+------------+------------+
EXPECTED_RESULT:
+------------+------------+------------+------------+------------+----------------------------------------------------------+----------------------------------------------------------+
|key_col_0 |key_col_1 |key_col_2 |key_col_3 |val_0 |A_CHILD |B_CHILD |
+------------+------------+------------+------------+------------+----------------------------------------------------------+----------------------------------------------------------+
|key000000 |key000001 |key000002 |key000003 |val_0 |array([key000000,key000001,key000002,key000003,val_0]) |array([key000000,key000001,key000002,key000003,val_0]) |
|key000010 |key000011 |key000012 |key000013 |val_1 |array([|key000010,key000011,key000012,key000013,val_1]) |array() |
|key000020 |key000021 |key000022 |key000023 |val_2 |array() |array([|key000020,key000021,key000022,key000023,val_2]) |
|key000030 |key000031 |key000032 |key000033 |val_3 |array() |array() |
|key000040 |key000041 |key000042 |key000043 |val_4 |array() |array() |
+------------+------------+------------+------------+------------+----------------------------------------------------------+----------------------------------------------------------+
我要在上面的示例EXPECTED_RESULT中将PARENT,A_CHILD和B_CHILD的三个数据帧连接到一个数据帧。 我找到了解决方案,但速度很慢:
val parentDF = ...
val childADF = ...
val childBDF = ...
val aggregatedAColName = "CHILD_A"
val aggregatedBColName = "CHILD_B"
val columns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3", "val_0")
val keyColumns = List("key_col_0", "key_col_1", "key_col_2", "key_col_3")
val nestedAColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedAColName)
val childADataFrame = childADF
.select(nestedAColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedAColName).alias(aggregatedAColName))
val joinedWithA = parentDF.join(childADataFrame, keyColumns, "left")
val nestedBColumns = keyColumns.map(x => col(x)) :+ struct(columns.map(col): _*).alias(aggregatedBColName)
val childBDataFrame = childBDF
.select(nestedBColumns: _*)
.repartition(keyColumns.map(col): _*)
.groupBy(keyColumns.map(col): _*)
.agg(collect_list(aggregatedBColName).alias(aggregatedBColName))
val joinedWithB = joinedWithA.join(childBDataFrame, keyColumns, "left")
我如何更快地做到呢?
答案 0 :(得分:-1)
我们可以将这些数据帧转换为rdd,然后转换为Pair RDD。然后,我们可以使用leftOuterJoin两次。我们将具有以下类型的值。
((key000000,key000001,key000002,key000003,val_0),(1,Some(1),Some(1)))
((key000010,key000011,key000012,key000013,val_1),(1,Some(1),None))
,依此类推...然后可以将它们映射到所需的表单。希望这会有所帮助。