我有一个表格如下:
ID start date name type
1 2020/01/01 cheese,meat A, B
1 2020/01/01 cheese,fruit A, C
所需的输出应该是:
ID start date count type
1 2020/01/01 cheese,meat,fruit A,B,C
我尝试使用 collect_list
和 collect_set
,但都不起作用。
答案 0 :(得分:2)
您可以拆分和分解列,然后按和 collect_set 分组:
import pyspark.sql.functions as F
df2 = df.withColumn(
'name',
F.explode(F.split('name', ','))
).withColumn(
'type',
F.explode(F.split('type', ','))
).groupBy(
'ID', 'start date'
).agg(
F.concat_ws(',', F.collect_set('name')).alias('name'),
F.concat_ws(',', F.collect_set('type')).alias('type')
)
df2.show()
+---+----------+-----------------+-----+
| ID|start date| name| type|
+---+----------+-----------------+-----+
| 1|2020/01/01|fruit,meat,cheese|C,B,A|
+---+----------+-----------------+-----+
答案 1 :(得分:1)
您可以使用 array_distinct
删除 collect_set
之后的重复项:
from pyspark.sql import functions as F
df1 = df.groupBy("ID", "start date").agg(
F.concat_ws(",", F.collect_set("name")).alias("name"),
F.concat_ws(",", F.collect_set("type")).alias("type"),
).select(
"ID",
"start date",
F.array_join(F.array_distinct(F.split("name", ",")), ",").alias("name"),
F.array_join(F.array_distinct(F.split("type", ",")), ",").alias("type")
)
df1.show()
# +---+----------+-----------------+-------+
# | ID|start date| name| type|
# +---+----------+-----------------+-------+
# | 1|2020/01/01|cheese,fruit,meat|A, C, B|
# +---+----------+-----------------+-------+
另一种使用 regexp_replace
删除重复项的方法:
df1 = df.groupBy("ID", "start date").agg(
F.concat_ws(",", F.collect_set("name")).alias("name"),
F.concat_ws(",", F.collect_set("type")).alias("type"),
).select(
"ID",
"start date",
F.regexp_replace("name", r"\b(\w+)\b\s*,\s*(?=.*\1)", "").alias("name"),
F.regexp_replace("type", r"\b(\w+)\b\s*,\s*(?=.*\1)", "").alias("type")
)
答案 2 :(得分:1)
您可以使用:
df.select(
df.ID,
df.start_date,
F.split(df.name, ',').alias('name'),
F.split(df.type, ',').alias('type')
).groupby('ID', 'start_date').agg(
F.concat_ws(',', F.array_distinct(F.flatten(F.collect_list('name')))).alias('name'),
F.concat_ws(',', F.array_distinct(F.flatten(F.collect_list('type')))).alias('type')
)
结果:
+---+----------+-----------------+-----+
| ID|start_date| name| type|
+---+----------+-----------------+-----+
| 1|2020/01/01|cheese,meat,fruit|A,B,C|
+---+----------+-----------------+-----+