将子对象分组为JSON数组

时间:2019-04-02 00:46:46

标签: apache-spark pyspark

我有2个数据集:

  1. 用户
Id, Name
1, Jack
2, Jill
3, James
  1. 活动
Id, Activity, UserId
101, Activity 1, 1
102, Activity 2, 1
201, Activity 3, 2
301, Activity 4, 3

如何使用PySpark向用户数据集添加名为“活动”的列,该列以JSON格式对与用户相关的所有活动进行分组。预期的输出是:

Id, Name, Activities
1, Jack, [{Id: 101, Name: Activity 1}, {Id: 102, Name: Activity 2}]
2, Jill, [{Id: 201, Name: Activity 3}]
3, James, [{Id: 301: Name: Activity 4}]

2 个答案:

答案 0 :(得分:2)

将非json和json数据组合在一起可能会有些棘手。下面的解决方案为所有列(包括ID和Name)创建JSON结构,因此其最终结果非常接近..

首先,让我们创建示例数据-

list1 = [1,"Jack"],[2,"Jill"],[3,"James"]
df1=spark.createDataFrame(list1,schema=["id","Name"])

list2= [101,"Activity1",1],[101,"Activity2",1],[201,"Activity3",2],[301,"Activity4",3]
df2=spark.createDataFrame(list2,schema=['Id','Activity','UserId'])

然后将两个数据帧都注册为临时表,因此我们可以在其上执行sql以按照所需的方式格式化数据-

df1.registerTempTable("table1")
df2.registerTempTable("table2")

然后运行一个使用collect_listnamed_struct的组合来紧密匹配最终结构要求的sql

df3= spark.sql("""
    WITH tmp 
     AS (SELECT t1.id, 
                Collect_list(Named_struct("id", t2.id, "name", t2.activity)) AS 
                   Activities 
         FROM   table1 t1 
                JOIN table2 t2 
                  ON ( t1.id = t2.userid ) 
         GROUP  BY t1.id) 
    SELECT tmp.id, 
           t3.NAME, 
           tmp.activities 
    FROM   tmp 
           JOIN table1 t3 
             ON ( tmp.id = t3.id ) 
        """)

df3.toJSON().collect()

这给我的结果是-

['{"id":1,"NAME":"Jack","activities":[{"id":101,"name":"Activity1"},{"id":101,"name":"Activity2"}]}',
 '{"id":3,"NAME":"James","activities":[{"id":301,"name":"Activity4"}]}',
 '{"id":2,"NAME":"Jill","activities":[{"id":201,"name":"Activity3"}]}']

如果我删除toJSON()转换并仅显示结果,则显示为

 df3.show() 


+---+-----+-----------------------------------+
| id| NAME|          activities               |
+---+-----+-----------------------------------+
|  1| Jack|[[101, Activity1],[101, Activity2]]|
|  3|James|  [[301, Activity4]]               |
|  2| Jill|  [[201, Activity3]]               |
+---+-----+-----------------------------------+

答案 1 :(得分:1)

假设您有两个数据框dfUser和dfActivities


joinDf=   dfUser.join(dfActivities, col('Id')==col('UserId'))
                .withColumnRenamed(dfActivities['Id'], "aId") 
                .groupBy(col("Id"))
                .agg(collect_list("aId","Activity").alias("Activities"))