Question

我有下表：

+-------+---------+---------+
|movieId|movieName|    genre|
+-------+---------+---------+
|      1| example1|   action|
|      1| example1| thriller|
|      1| example1|  romance|
|      2| example2|fantastic|
|      2| example2|   action|
+-------+---------+---------+

我想要实现的是将类型值附加在一起，其中id和name是相同的。像这样：

+-------+---------+---------------------------+
|movieId|movieName|    genre                  |
+-------+---------+---------------------------+
|      1| example1|   action|thriller|romance |
|      2| example2|   action|fantastic        |
+-------+---------+---------------------------+

Answer 1

使用groupBy和collect_list获取具有相同电影名称的所有项目的列表。然后使用concat_ws将这些组合成一个字符串（如果顺序很重要，请先使用sort_array）。给定样本数据帧的小例子：

val df2 = df.groupBy("movieId", "movieName")
  .agg(collect_list($"genre").as("genre"))
  .withColumn("genre", concat_ws("|", sort_array($"genre")))

给出结果：

+-------+---------+-----------------------+
|movieId|movieName|genre                  |
+-------+---------+-----------------------+
|1      |example1 |action|thriller|romance|
|2      |example2 |action|fantastic       |
+-------+---------+-----------------------+

如何在Spark SQL中追加列值？

1 个答案: