Question

我有一个火花DF如下。我需要汇总多行，其ID与单行相同，但值应该是不同的。

id|values
1 |hello
1 |hello Sam
1 |hello Tom
2 |hello
2 |hello Tom

预期产出

id|values
1 |hello, Sam, Tom
2 |hello, Tom

我完成了汇总部分但是如何过滤重复的令牌？

当前代码：

df.select("id","values")
  .groupBy("id")
  .agg(concat_ws(",", collect_list("values")))

问题的第二部分： 我也尝试通过SQL，但它也显示重复。

spark.sql("select id, concat_ws(' ' ,collect_set(values)) as values from data group by id ").show(false)
+---+----------------------------+
|id |values                      |
+---+----------------------------+
|1  |hello hello Sam hello Tom   |
|2  |hello hello Tom             |
+---+----------------------------+

如何摆脱上述查询中的重复

Answer 1

您可以将collect_set用作

df.select("id","values").groupBy("id").agg(concat_ws(",",collect_set("values")))

<强>更新

如果您使用空格分隔字符串，则上方不起作用

您需要使用空格分割并使用udf找到不同的

val tokenize = udf((value: Seq[String]) => {
  value.flatMap(_.split(",|\\s+")).map(_.trim).distinct
})

df.select("id", "values").groupBy("id").agg(collect_list("values").as("value"))
    .withColumn("value1", tokenize($"value"))

.show(false)

输出：

+---+-----------------+
|id |value            |
+---+-----------------+
|1  |[hello, Sam, Tom]|
|2  |[hello, Tom]     |
+---+-----------------+

Answer 2

对于正在寻找100％sql解决方案的人们来说，类似的方法对我有用，可以生成一个逗号分隔的列表，该字符串表示我正在寻找的列表：

select  
  patient_id, 
  concat_ws(",", collect_set(distinct encounter_id)) enc_list, 
  count(distinct encounter_id) enc_count
from
  encounter 
group by 1;

将多行汇总到spark中的单行和列中

2 个答案: