Groupby 并将不同的值聚合为字符串

时间:2021-02-25 05:01:07

标签: python string apache-spark pyspark group-by

我有一个表格如下:

ID   start date     name        type
 1   2020/01/01   cheese,meat    A, B
 1   2020/01/01   cheese,fruit   A, C

所需的输出应该是:

ID    start date    count                 type 
1     2020/01/01   cheese,meat,fruit      A,B,C

我尝试使用 collect_listcollect_set,但都不起作用。

3 个答案:

答案 0 :(得分:2)

您可以拆分和分解列,然后按和 collect_set 分组:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'name',
    F.explode(F.split('name', ','))
).withColumn(
    'type',
    F.explode(F.split('type', ','))
).groupBy(
    'ID', 'start date'
).agg(
    F.concat_ws(',', F.collect_set('name')).alias('name'),
    F.concat_ws(',', F.collect_set('type')).alias('type')
)

df2.show()
+---+----------+-----------------+-----+
| ID|start date|             name| type|
+---+----------+-----------------+-----+
|  1|2020/01/01|fruit,meat,cheese|C,B,A|
+---+----------+-----------------+-----+

答案 1 :(得分:1)

您可以使用 array_distinct 删除 collect_set 之后的重复项:

from pyspark.sql import functions as F

df1 = df.groupBy("ID", "start date").agg(
    F.concat_ws(",", F.collect_set("name")).alias("name"),
    F.concat_ws(",", F.collect_set("type")).alias("type"),
).select(
    "ID",
    "start date",
    F.array_join(F.array_distinct(F.split("name", ",")), ",").alias("name"),
    F.array_join(F.array_distinct(F.split("type", ",")), ",").alias("type")
)

df1.show()

# +---+----------+-----------------+-------+
# | ID|start date|             name|   type|
# +---+----------+-----------------+-------+
# |  1|2020/01/01|cheese,fruit,meat|A, C, B|
# +---+----------+-----------------+-------+

另一种使用 regexp_replace 删除重复项的方法:

df1 = df.groupBy("ID", "start date").agg(
    F.concat_ws(",", F.collect_set("name")).alias("name"),
    F.concat_ws(",", F.collect_set("type")).alias("type"),
).select(
    "ID",
    "start date",
    F.regexp_replace("name", r"\b(\w+)\b\s*,\s*(?=.*\1)", "").alias("name"),
    F.regexp_replace("type", r"\b(\w+)\b\s*,\s*(?=.*\1)", "").alias("type")
)

答案 2 :(得分:1)

您可以使用:

df.select(
    df.ID,
    df.start_date,
    F.split(df.name, ',').alias('name'),
    F.split(df.type, ',').alias('type')
).groupby('ID', 'start_date').agg(
    F.concat_ws(',', F.array_distinct(F.flatten(F.collect_list('name')))).alias('name'),
    F.concat_ws(',', F.array_distinct(F.flatten(F.collect_list('type')))).alias('type')
)

结果:

+---+----------+-----------------+-----+
| ID|start_date|             name| type|
+---+----------+-----------------+-----+
|  1|2020/01/01|cheese,meat,fruit|A,B,C|
+---+----------+-----------------+-----+