Question

我有一个类似于this的问题，但collect_list要操作的列数由名称列表给出。例如：

scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
|  A|   D1|  T0|   P1|
|  A|   D0|  T1|   P2|
|  B|   Y1|  T0|   P3|
|  B|   Y2|  T2|   P3|
|  C|   H1|  T0|   P5|
|  C|   H0|  T9|   P5|
|  B|   Y0|  T1|   P2|
|  B|   H1|  T3|   P6|
|  D|   H1|  T2|   P4|
+---+-----+----+-----+


scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)

scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]

scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
|  B|   [Y1, Y2, Y0, H1]|  [T0, T2, T1, T3]|   [P3, P3, P2, P6]|
|  D|               [H1]|              [T2]|               [P4]|
|  C|           [H1, H0]|          [T0, T9]|           [P5, P5]|
|  A|           [D1, D0]|          [T0, T1]|           [P1, P2]|
+---+-------------------+------------------+-------------------+

有没有办法可以在不知道combList之前的元素数量的情况下将collect_list应用于agg内的多个列？

Answer 1

您可以使用collect_list（struct（col1，col2））AS元素。

示例：

df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema

df
 |-- cd_issuer: string (nullable = true)
 |-- cd_doc: string (nullable = true)
 |-- cd_item: string (nullable = true)
 |-- nm_item: string (nullable = true)

outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- cd_item: string (nullable = true)
|    |    |-- nm_item: string (nullable = true)

Spark GroupBy agg collect_list多列

1 个答案: