将数据帧转换为嵌套的 Json 输出

时间:2021-04-20 08:05:58

标签: json dataframe apache-spark apache-spark-sql

我有一个从数据派生的数据框,它给了我这样的东西

<头>
id 标识符 actual_cost cost_incurred 时间戳
1 abc123 24 21 2021-04-16T19:07:00
2 xyz987 12 34 2021-04-16T19:25:27
2 xyz987 92 87 2021-04-16T19:32:43
1 abc123 37 39 2021-04-16T19:26:30
3 abc567 87 85 2021-04-16T19:13:00

我的要求是最终的转储文件应该将整个数据框作为这样的嵌套 JSON

 {
"hits": [
    {
        "id": 1,
        "identifier": "abc123",
        "cost": [
            {
                "actual_cost": 24,
                "cost_incurred": 21,
                "timestamp": "2021-04-16T19:07:00"
            },
            {
                "actual_cost": 37,
                "cost_incurred": 39,
                "timestamp": "2021-04-16T19:26:30"
            }
        ]
    },
    {
        "id": 2,
        "identifier": "xyz987",
        "cost": [
            {
                "actual_cost": 12,
                "cost_incurred": 34,
                "timestamp": "2021-04-16T19:25:27"
            },
            {
                "actual_cost": 37,
                "cost_incurred": 39,
                "timestamp": "2021-04-16T19:26:30"
            }
        ]
    },
    {
        "id": 3,
        "identifier": "abc567",
        "cost": [
            {
                "actual_cost": 87,
                "cost_incurred": 85,
                "timestamp": "2021-04-16T19:13:00"
            }
        ]
    }
]
}

我正在查看 map 函数,但无法找出对结果进行分组的方法。 任何线索或解决方案将不胜感激。

2 个答案:

答案 0 :(得分:0)

to_json 将成为您的朋友 :) 以及一些分组和聚合:

df.createOrReplaceTempView("df")

result = spark.sql("""
    select 
        to_json(struct(collect_list(item) hits)) result 
    from (
        select 
            struct(
                id, identifier, collect_list(struct(actual_cost, cost_incurred, timestamp)) cost
            ) item 
        from df 
        group by id, identifier
    )
""")

result.show()
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"hits":[{"id":"2","identifier":"xyz987","cost":[{"actual_cost":"12","cost_incurred":"34","timestamp":"2021-04-16T19:25:27"},{"actual_cost":"92","cost_incurred":"87","timestamp":"2021-04-16T19:32:43"}]},{"id":"1","identifier":"abc123","cost":[{"actual_cost":"24","cost_incurred":"21","timestamp":"2021-04-16T19:07:00"},{"actual_cost":"37","cost_incurred":"39","timestamp":"2021-04-16T19:26:30"}]},{"id":"3","identifier":"abc567","cost":[{"actual_cost":"87","cost_incurred":"85","timestamp":"2021-04-16T19:13:00"}]}]}|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

答案 1 :(得分:0)

这里是如何使用 groupBy 一些聚合和 toJSON 来做到这一点

val resultDf = df.groupBy("id", "identifier")
  .agg(collect_list(struct("actual_cost", "cost_incurred", "timestamp")) as "cost")
  .toJSON
resultDf.show(false)

结果:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"id":2,"identifier":"xyz987","cost":[{"actual_cost":12,"cost_incurred":34,"timestamp":"2021-04-16T19:25:27"},{"actual_cost":92,"cost_incurred":87,"timestamp":"2021-04-16T19:32:43"}]}|
|{"id":1,"identifier":"abc123","cost":[{"actual_cost":24,"cost_incurred":21,"timestamp":"2021-04-16T19:07:00"},{"actual_cost":37,"cost_incurred":39,"timestamp":"2021-04-16T19:26:30"}]}|
|{"id":3,"identifier":"abc567","cost":[{"actual_cost":87,"cost_incurred":85,"timestamp":"2021-04-16T19:13:00"}]}                                                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

如果你想在一行中,那么

result.agg(to_json(collect_list(struct(result.columns.map(col): _*))).as("hits"))
.show(false)

结果:

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|hits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"id":2,"identifier":"xyz987","cost":[{"actual_cost":12,"cost_incurred":34,"timestamp":"2021-04-16T19:25:27"},{"actual_cost":92,"cost_incurred":87,"timestamp":"2021-04-16T19:32:43"}]},{"id":1,"identifier":"abc123","cost":[{"actual_cost":24,"cost_incurred":21,"timestamp":"2021-04-16T19:07:00"},{"actual_cost":37,"cost_incurred":39,"timestamp":"2021-04-16T19:26:30"}]},{"id":3,"identifier":"abc567","cost":[{"actual_cost":87,"cost_incurred":85,"timestamp":"2021-04-16T19:13:00"}]}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+