我有2个数据集:
Id, Name
1, Jack
2, Jill
3, James
Id, Activity, UserId
101, Activity 1, 1
102, Activity 2, 1
201, Activity 3, 2
301, Activity 4, 3
如何使用PySpark向用户数据集添加名为“活动”的列,该列以JSON格式对与用户相关的所有活动进行分组。预期的输出是:
Id, Name, Activities
1, Jack, [{Id: 101, Name: Activity 1}, {Id: 102, Name: Activity 2}]
2, Jill, [{Id: 201, Name: Activity 3}]
3, James, [{Id: 301: Name: Activity 4}]
答案 0 :(得分:2)
将非json和json数据组合在一起可能会有些棘手。下面的解决方案为所有列(包括ID和Name)创建JSON结构,因此其最终结果非常接近..
首先,让我们创建示例数据-
list1 = [1,"Jack"],[2,"Jill"],[3,"James"]
df1=spark.createDataFrame(list1,schema=["id","Name"])
list2= [101,"Activity1",1],[101,"Activity2",1],[201,"Activity3",2],[301,"Activity4",3]
df2=spark.createDataFrame(list2,schema=['Id','Activity','UserId'])
然后将两个数据帧都注册为临时表,因此我们可以在其上执行sql以按照所需的方式格式化数据-
df1.registerTempTable("table1")
df2.registerTempTable("table2")
然后运行一个使用collect_list
和named_struct
的组合来紧密匹配最终结构要求的sql
df3= spark.sql("""
WITH tmp
AS (SELECT t1.id,
Collect_list(Named_struct("id", t2.id, "name", t2.activity)) AS
Activities
FROM table1 t1
JOIN table2 t2
ON ( t1.id = t2.userid )
GROUP BY t1.id)
SELECT tmp.id,
t3.NAME,
tmp.activities
FROM tmp
JOIN table1 t3
ON ( tmp.id = t3.id )
""")
df3.toJSON().collect()
这给我的结果是-
['{"id":1,"NAME":"Jack","activities":[{"id":101,"name":"Activity1"},{"id":101,"name":"Activity2"}]}',
'{"id":3,"NAME":"James","activities":[{"id":301,"name":"Activity4"}]}',
'{"id":2,"NAME":"Jill","activities":[{"id":201,"name":"Activity3"}]}']
如果我删除toJSON()
转换并仅显示结果,则显示为
df3.show()
+---+-----+-----------------------------------+
| id| NAME| activities |
+---+-----+-----------------------------------+
| 1| Jack|[[101, Activity1],[101, Activity2]]|
| 3|James| [[301, Activity4]] |
| 2| Jill| [[201, Activity3]] |
+---+-----+-----------------------------------+
答案 1 :(得分:1)
假设您有两个数据框dfUser和dfActivities
joinDf= dfUser.join(dfActivities, col('Id')==col('UserId'))
.withColumnRenamed(dfActivities['Id'], "aId")
.groupBy(col("Id"))
.agg(collect_list("aId","Activity").alias("Activities"))