我有带有示例数据的dataFrame unionDataDF
+---+------------------+----+
| id| data| key|
+---+------------------+----+
| 1|[{"data":"data1"}]|key1|
| 2|[{"data":"data2"}]|key1|
| 1|[{"data":"data1"}]|key2|
| 2|[{"data":"data2"}]|key2|
+---+------------------+----+
其中 id 是IntType, data 是JsonType, key 是StringType。
我想通过网络为每个ID发送数据。例如, id “ 1”的输出数据如下:
{
"id": 1,
"data": {
"key1": [{
"data": "data1"
}],
"key2": [{
"data": "data1"
}]
}
}
我该怎么做?
创建 unionDataDF
的示例代码val dummyDataDF= Seq((1, "data1"), (2, "data2")).toDF("id", "data");
val key1JsonDataDF = dummyDataDF.withColumn("data", to_json(struct( $"data"))).groupBy("id").agg(collect_list($"data").alias("data")).withColumn("key", lit("key1"))
val key2JsonDataDF = dummyDataDF.withColumn("data", to_json(struct( $"data"))).groupBy("id").agg(collect_list($"data").alias("data")).withColumn("key", lit("key2"))
val unionDataDF = key1JsonDataDF.union(key2JsonDataDF)
版本:
Spark: 2.2
Scala: 2.11
答案 0 :(得分:0)
类似
unionDataDF
.groupBy("id")
.agg(collect_list(struct("key", "data")).alias("grouped"))
.show(10, false)
输出:
+---+--------------------------------------------------------+
|id |grouped |
+---+--------------------------------------------------------+
|1 |[[key1, [{"data":"data1"}]], [key2, [{"data":"data1"}]]]|
|2 |[[key1, [{"data":"data2"}]], [key2, [{"data":"data2"}]]]|
+---+--------------------------------------------------------+