将scala spark数据帧的结果合并为另一个数据帧列中的结果数组

时间:2017-10-05 19:51:35

标签: scala apache-spark dataframe apache-spark-sql

有没有办法获取以下两个数据帧并通过col0字段连接它们以产生下面的输出?

// dataframe1

val df1 = Seq(
  (1, 9, 100.1, 10),
).toDF("pk", "col0", "col1", "col2")

// dataframe2

val df2 = Seq(
  (1, 9 "a1", "b1"),
  (2, 9 "a2", "b2")
).toDF("pk", "col0", "str_col1", "str_col2")

//预期的数据框结果

+---+-----+----+---------------------------+
| pk| col1|col2| new_arr_col               |
+---+-----+----+---------------------------+
|  1|100.1|  10|[[1,9,a1, b1],[2,9,a2, b2]]|
+---+-----+----+---------------------------+

1 个答案:

答案 0 :(得分:1)

import org.apache.spark.sql.functions._
import spark.implicits._

// creating new array column out of all df2 columns:
val df2AsArray = df2.select($"col0", array(df2.columns.map(col): _*) as "new_arr_col")

val result = df1.join(df2AsArray, "col0")
  .groupBy(df1.columns.map(col): _*) // grouping by all df1 columns
  .agg(collect_list("new_arr_col") as "new_arr_col") // collecting array of arrays
  .drop("col0")

result.show(false)
// +---+-----+----+--------------------------------------------------------+
// |pk |col1 |col2|new_arr_col                                             |
// +---+-----+----+--------------------------------------------------------+
// |1  |100.1|10  |[WrappedArray(2, 9, a2, b2), WrappedArray(1, 9, a1, b1)]|
// +---+-----+----+--------------------------------------------------------+