创建具有相同ID的所有行的嵌套JSON:DataFrame

时间:2019-02-13 13:19:44

标签: json scala apache-spark

我有一个带有三列的DataFrame df4

  1. id注释实体
  2. data具有JSON数组数据
  3. executor_id作为字符串值

要创建的代码如下:

val df1 = Seq((1, "n1", "d1")).toDF("id",  "number", "data")

val df2 = df1.withColumn("data", to_json(struct($"number", $"data"))).groupBy("id").agg(collect_list($"data").alias("data")).withColumn("executor_id", lit("e1"))

val df3 = df1.withColumn("data", to_json(struct($"number", $"data"))).groupBy("id").agg(collect_list($"data").alias("data")).withColumn("executor_id", lit("e2"))

val df4 = df2.union(df3)

DF4的内容就像

scala> df4.show(false)
+---+-----------------------------+-----------+
|id |data                         |executor_id|
+---+-----------------------------+-----------+
|1  |[{"number":"n1","data":"d1"}]|e1         |
|1  |[{"number":"n1","data":"d1"}]|e2         |
+---+-----------------------------+-----------+

我必须创建新的json数据,并以executor_id为键,将data作为键,将id作为json数据。像这样的结果dataFrame

+---+------------------------------------------------------------------------+
|id |new_data                                                                |
+---+------------------------------------------------------------------------+
|1  |{"e1":[{"number":"n1","data":"d1"}], "e2":[{"number":"n1","data":"d1"}]}|
+---+------------------------------------------------------------------------+

版本:

Spark: 2.2
Scala: 2.11

1 个答案:

答案 0 :(得分:0)

过去三天来,我一直在努力解决此问题,最终能够使用UserDefinedAggregateFunction解决该问题。这是相同的示例代码

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._

import scala.collection.mutable
import scala.collection.mutable.ListBuffer

class CustomAggregator extends UserDefinedAggregateFunction {
  override def inputSchema: org.apache.spark.sql.types.StructType =
    StructType(Array(StructField("key", StringType), StructField("value", ArrayType(StringType))))

  // This is the internal fields you keep for computing your aggregate
  override def bufferSchema: StructType = StructType(
    Array(StructField("mapData", MapType(StringType, ArrayType(StringType))))
  )

  // This is the output type of your aggregatation function.
  override def dataType = StringType

  override def deterministic: Boolean = true

  // This is the initial value for your buffer schema.
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = scala.collection.mutable.Map[String, String]()
  }

  // This is how to update your buffer schema given an input.
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    buffer(0) = buffer.getMap(0) + (input.getAs[String](0) -> input.getAs[String](1))
  }

  // This is how to merge two objects with the bufferSchema type.
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1.update(0, buffer1.getAs[Map[String, Any]](0) ++ buffer2.getAs[Map[String, Any]](0))
  }

  // This is where you output the final value, given the final value of your bufferSchema.
  override def evaluate(buffer: Row): Any = {
    val map = buffer(0).asInstanceOf[Map[Any, Any]]
    val buff: ListBuffer[String] = ListBuffer()
    for ((k, v) <- map) {
      val valArray = v.asInstanceOf[mutable.WrappedArray[Any]].array;
      val tmp = {
        for {
          valString <- valArray
        } yield valString.toString
      }.toList.mkString(",")
      buff += "\"" + k.toString + "\":[" + tmp + "]"
    }
    "{" + buff.toList.mkString(",") + "}"
  }
}

现在使用customAggregator,

val ca = new CustomAggregator
val df5 = df4.groupBy("id").agg(ca(col("executor_id"), col("data")).as("jsonData"))

结果DF为

scala> df5.show(false)
+---+-----------------------------------------------------------------------+
|id |jsonData                                                               |
+---+-----------------------------------------------------------------------+
|1  |{"e1":[{"number":"n1","data":"d1"}],"e2":[{"number":"n1","data":"d1"}]}|
+---+-----------------------------------------------------------------------+

即使我已经解决了这个问题,也不确定这是否正确。我怀疑的原因是

  1. 在某些地方,我使用过Any。我不认为这是正确的。
  2. 对于每次评估,我都在创建ListBuffer和许多其他数据类型。我不确定代码的性能。
  3. 我仍然必须测试许多数据类型的代码,例如double,date tpye,嵌套的json等作为数据。
相关问题