我有一个从数据派生的数据框,它给了我这样的东西
id | 标识符 | actual_cost | cost_incurred | 时间戳 |
---|---|---|---|---|
1 | abc123 | 24 | 21 | 2021-04-16T19:07:00 |
2 | xyz987 | 12 | 34 | 2021-04-16T19:25:27 |
2 | xyz987 | 92 | 87 | 2021-04-16T19:32:43 |
1 | abc123 | 37 | 39 | 2021-04-16T19:26:30 |
3 | abc567 | 87 | 85 | 2021-04-16T19:13:00 |
我的要求是最终的转储文件应该将整个数据框作为这样的嵌套 JSON
{
"hits": [
{
"id": 1,
"identifier": "abc123",
"cost": [
{
"actual_cost": 24,
"cost_incurred": 21,
"timestamp": "2021-04-16T19:07:00"
},
{
"actual_cost": 37,
"cost_incurred": 39,
"timestamp": "2021-04-16T19:26:30"
}
]
},
{
"id": 2,
"identifier": "xyz987",
"cost": [
{
"actual_cost": 12,
"cost_incurred": 34,
"timestamp": "2021-04-16T19:25:27"
},
{
"actual_cost": 37,
"cost_incurred": 39,
"timestamp": "2021-04-16T19:26:30"
}
]
},
{
"id": 3,
"identifier": "abc567",
"cost": [
{
"actual_cost": 87,
"cost_incurred": 85,
"timestamp": "2021-04-16T19:13:00"
}
]
}
]
}
我正在查看 map 函数,但无法找出对结果进行分组的方法。 任何线索或解决方案将不胜感激。
答案 0 :(得分:0)
to_json
将成为您的朋友 :) 以及一些分组和聚合:
df.createOrReplaceTempView("df")
result = spark.sql("""
select
to_json(struct(collect_list(item) hits)) result
from (
select
struct(
id, identifier, collect_list(struct(actual_cost, cost_incurred, timestamp)) cost
) item
from df
group by id, identifier
)
""")
result.show()
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"hits":[{"id":"2","identifier":"xyz987","cost":[{"actual_cost":"12","cost_incurred":"34","timestamp":"2021-04-16T19:25:27"},{"actual_cost":"92","cost_incurred":"87","timestamp":"2021-04-16T19:32:43"}]},{"id":"1","identifier":"abc123","cost":[{"actual_cost":"24","cost_incurred":"21","timestamp":"2021-04-16T19:07:00"},{"actual_cost":"37","cost_incurred":"39","timestamp":"2021-04-16T19:26:30"}]},{"id":"3","identifier":"abc567","cost":[{"actual_cost":"87","cost_incurred":"85","timestamp":"2021-04-16T19:13:00"}]}]}|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
答案 1 :(得分:0)
这里是如何使用 groupBy 一些聚合和 toJSON 来做到这一点
val resultDf = df.groupBy("id", "identifier")
.agg(collect_list(struct("actual_cost", "cost_incurred", "timestamp")) as "cost")
.toJSON
resultDf.show(false)
结果:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"id":2,"identifier":"xyz987","cost":[{"actual_cost":12,"cost_incurred":34,"timestamp":"2021-04-16T19:25:27"},{"actual_cost":92,"cost_incurred":87,"timestamp":"2021-04-16T19:32:43"}]}|
|{"id":1,"identifier":"abc123","cost":[{"actual_cost":24,"cost_incurred":21,"timestamp":"2021-04-16T19:07:00"},{"actual_cost":37,"cost_incurred":39,"timestamp":"2021-04-16T19:26:30"}]}|
|{"id":3,"identifier":"abc567","cost":[{"actual_cost":87,"cost_incurred":85,"timestamp":"2021-04-16T19:13:00"}]} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
如果你想在一行中,那么
result.agg(to_json(collect_list(struct(result.columns.map(col): _*))).as("hits"))
.show(false)
结果:
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|hits |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"id":2,"identifier":"xyz987","cost":[{"actual_cost":12,"cost_incurred":34,"timestamp":"2021-04-16T19:25:27"},{"actual_cost":92,"cost_incurred":87,"timestamp":"2021-04-16T19:32:43"}]},{"id":1,"identifier":"abc123","cost":[{"actual_cost":24,"cost_incurred":21,"timestamp":"2021-04-16T19:07:00"},{"actual_cost":37,"cost_incurred":39,"timestamp":"2021-04-16T19:26:30"}]},{"id":3,"identifier":"abc567","cost":[{"actual_cost":87,"cost_incurred":85,"timestamp":"2021-04-16T19:13:00"}]}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+