Question

我有一个表格如下：

ID   start date     name        type
 1   2020/01/01   cheese,meat    A, B
 1   2020/01/01   cheese,fruit   A, C

所需的输出应该是：

ID    start date    count                 type 
1     2020/01/01   cheese,meat,fruit      A,B,C

我尝试使用 collect_list 和 collect_set，但都不起作用。

Answer 1

您可以拆分和分解列，然后按和 collect_set 分组：

import pyspark.sql.functions as F

df2 = df.withColumn(
    'name',
    F.explode(F.split('name', ','))
).withColumn(
    'type',
    F.explode(F.split('type', ','))
).groupBy(
    'ID', 'start date'
).agg(
    F.concat_ws(',', F.collect_set('name')).alias('name'),
    F.concat_ws(',', F.collect_set('type')).alias('type')
)

df2.show()
+---+----------+-----------------+-----+
| ID|start date|             name| type|
+---+----------+-----------------+-----+
|  1|2020/01/01|fruit,meat,cheese|C,B,A|
+---+----------+-----------------+-----+

Answer 2

您可以使用 array_distinct 删除 collect_set 之后的重复项：

from pyspark.sql import functions as F

df1 = df.groupBy("ID", "start date").agg(
    F.concat_ws(",", F.collect_set("name")).alias("name"),
    F.concat_ws(",", F.collect_set("type")).alias("type"),
).select(
    "ID",
    "start date",
    F.array_join(F.array_distinct(F.split("name", ",")), ",").alias("name"),
    F.array_join(F.array_distinct(F.split("type", ",")), ",").alias("type")
)

df1.show()

# +---+----------+-----------------+-------+
# | ID|start date|             name|   type|
# +---+----------+-----------------+-------+
# |  1|2020/01/01|cheese,fruit,meat|A, C, B|
# +---+----------+-----------------+-------+

另一种使用 regexp_replace 删除重复项的方法：

df1 = df.groupBy("ID", "start date").agg(
    F.concat_ws(",", F.collect_set("name")).alias("name"),
    F.concat_ws(",", F.collect_set("type")).alias("type"),
).select(
    "ID",
    "start date",
    F.regexp_replace("name", r"\b(\w+)\b\s*,\s*(?=.*\1)", "").alias("name"),
    F.regexp_replace("type", r"\b(\w+)\b\s*,\s*(?=.*\1)", "").alias("type")
)

Answer 3

您可以使用：

df.select(
    df.ID,
    df.start_date,
    F.split(df.name, ',').alias('name'),
    F.split(df.type, ',').alias('type')
).groupby('ID', 'start_date').agg(
    F.concat_ws(',', F.array_distinct(F.flatten(F.collect_list('name')))).alias('name'),
    F.concat_ws(',', F.array_distinct(F.flatten(F.collect_list('type')))).alias('type')
)

结果：

+---+----------+-----------------+-----+
| ID|start_date|             name| type|
+---+----------+-----------------+-----+
|  1|2020/01/01|cheese,meat,fruit|A,B,C|
+---+----------+-----------------+-----+

Groupby 并将不同的值聚合为字符串

3 个答案: