我有一个带有三列的DataFrame df4
id
注释实体data
具有JSON数组数据executor_id
作为字符串值要创建的代码如下:
val df1 = Seq((1, "n1", "d1")).toDF("id", "number", "data")
val df2 = df1.withColumn("data", to_json(struct($"number", $"data"))).groupBy("id").agg(collect_list($"data").alias("data")).withColumn("executor_id", lit("e1"))
val df3 = df1.withColumn("data", to_json(struct($"number", $"data"))).groupBy("id").agg(collect_list($"data").alias("data")).withColumn("executor_id", lit("e2"))
val df4 = df2.union(df3)
DF4的内容就像
scala> df4.show(false)
+---+-----------------------------+-----------+
|id |data |executor_id|
+---+-----------------------------+-----------+
|1 |[{"number":"n1","data":"d1"}]|e1 |
|1 |[{"number":"n1","data":"d1"}]|e2 |
+---+-----------------------------+-----------+
我必须创建新的json数据,并以executor_id
为键,将data
作为键,将id
作为json数据。像这样的结果dataFrame
+---+------------------------------------------------------------------------+
|id |new_data |
+---+------------------------------------------------------------------------+
|1 |{"e1":[{"number":"n1","data":"d1"}], "e2":[{"number":"n1","data":"d1"}]}|
+---+------------------------------------------------------------------------+
版本:
Spark: 2.2
Scala: 2.11
答案 0 :(得分:0)
过去三天来,我一直在努力解决此问题,最终能够使用UserDefinedAggregateFunction
解决该问题。这是相同的示例代码
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scala.collection.mutable
import scala.collection.mutable.ListBuffer
class CustomAggregator extends UserDefinedAggregateFunction {
override def inputSchema: org.apache.spark.sql.types.StructType =
StructType(Array(StructField("key", StringType), StructField("value", ArrayType(StringType))))
// This is the internal fields you keep for computing your aggregate
override def bufferSchema: StructType = StructType(
Array(StructField("mapData", MapType(StringType, ArrayType(StringType))))
)
// This is the output type of your aggregatation function.
override def dataType = StringType
override def deterministic: Boolean = true
// This is the initial value for your buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = scala.collection.mutable.Map[String, String]()
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getMap(0) + (input.getAs[String](0) -> input.getAs[String](1))
}
// This is how to merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0, buffer1.getAs[Map[String, Any]](0) ++ buffer2.getAs[Map[String, Any]](0))
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): Any = {
val map = buffer(0).asInstanceOf[Map[Any, Any]]
val buff: ListBuffer[String] = ListBuffer()
for ((k, v) <- map) {
val valArray = v.asInstanceOf[mutable.WrappedArray[Any]].array;
val tmp = {
for {
valString <- valArray
} yield valString.toString
}.toList.mkString(",")
buff += "\"" + k.toString + "\":[" + tmp + "]"
}
"{" + buff.toList.mkString(",") + "}"
}
}
现在使用customAggregator,
val ca = new CustomAggregator
val df5 = df4.groupBy("id").agg(ca(col("executor_id"), col("data")).as("jsonData"))
结果DF为
scala> df5.show(false)
+---+-----------------------------------------------------------------------+
|id |jsonData |
+---+-----------------------------------------------------------------------+
|1 |{"e1":[{"number":"n1","data":"d1"}],"e2":[{"number":"n1","data":"d1"}]}|
+---+-----------------------------------------------------------------------+
即使我已经解决了这个问题,也不确定这是否正确。我怀疑的原因是
Any
。我不认为这是正确的。