通过火花将结构传递给UDAF

时间:2019-02-04 14:17:40

标签: scala apache-spark hadoop apache-spark-sql user-defined-functions

我有以下架构-

root
 |-- id:string (nullable = false)
 |-- age: long (nullable = true)
 |-- cars: struct (nullable = true)
 |    |-- car1: string (nullable = true)
 |    |-- car2: string (nullable = true)
 |    |-- car3: string (nullable = true)
 |-- name: string (nullable = true)

如何将结构“汽车”传递给udaf?如果我只想传递cars子结构,应该是inputSchema。

1 个答案:

答案 0 :(得分:2)

可以,但是UDAF的逻辑会有所不同。例如,如果您有两行:

val seq = Seq(cars(cars_schema("car1", "car2", "car3")), (cars(cars_schema("car1", "car2", "car3"))))

val rdd = spark.sparkContext.parallelize(seq)

这里的架构是

root
 |-- cars: struct (nullable = true)
 |    |-- car1: string (nullable = true)
 |    |-- car2: string (nullable = true)
 |    |-- car3: string (nullable = true)

然后,如果您尝试调用聚合:

val df = seq.toDF
df.agg(agg0(col("cars")))

您必须更改UDAF的输入架构,例如:

val carsSchema =
    StructType(List(StructField("car1", StringType, true), StructField("car2", StringType, true), StructField("car3", StringType, true)))

并且在您的UDAF的男孩中,您必须处理此架构以更改inputSchema:

override def inputSchema: StructType = StructType(StructField("input", carsSchema) :: Nil)

在更新方法中,您必须处理输入行的格式:

override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
  val i = input.getAs[Array[Array[String]]](0)
  // i here would be [car1,car2,car3],  an array of strings
  buffer(0) = ???
}

从这里开始,您可以转换i以更新缓冲区并完成合并和求值功能。