如何通过火花映射器传递数据而不在参数类中对其建模?

时间:2018-12-24 17:27:02

标签: scala apache-spark spark-structured-streaming

我需要对数据帧行进行状态处理。为此,我需要创建一个bean或case类,为有状态处理所需的数据建模。我想在状态处理后继续使用数据帧中的其他数据,而无需在case类中对其进行建模。该怎么办?

在无状态处理中,我们可以使用UDF保留在DataFrame区域中,但是这里没有该选项。

这是我尝试过的:

package com.example.so

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode}

case class WibbleState() // just a placeholder

case class Wibble
(
  x: String,
  y: Int,
  data: Row // data I don't want to model in the case class
)

object PartialModelization {

  def wibbleStateFlatMapper(k: String,
                            it: Iterator[Wibble],
                            state: GroupState[WibbleState]): Iterator[Wibble] = it

  def main(args: Array[String]) {
    val spark = SparkSession.builder()
      .appName("PartialModelization")
      .master("local[*]").getOrCreate()

    import spark.implicits._

    // imagine this is actually a streaming data frame
    val input = spark.createDataFrame(List(("a", 1, 0), ("b", 1, 2)))
      .toDF("x", "y", "z")
    // dont want to model z in the case class
    // if that seems pointless imagine there is also z1, z2, z3, etc
    // or that z is itself a struct

    input.select($"x", $"y", struct("*").as("data"))
      .as[Wibble]
      .groupByKey(w => w.x)
      .flatMapGroupsWithState[WibbleState, Wibble](
        OutputMode.Append, GroupStateTimeout.NoTimeout)(wibbleStateFlatMapper)
      .select("data.*")
      .show()

  }

}

哪个出现此错误:

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "data")
- root class: "com.example.so.Wibble"

从概念上讲,您可能建议尝试找到一些密钥,该密钥可以让我们与输入数据一起加入输出数据帧以恢复“数据”属性,但是从性能和实现复杂性的角度来看,这似乎是一个可怕的解决方案。 (在这种情况下,我宁愿只在案例类中输入整个数据结构!)

1 个答案:

答案 0 :(得分:0)

到目前为止,我发现的最佳解决方案是使用元组来分隔映射器数据和行数据。

因此,我们从Wibble中删除了data属性。

case class Wibble
(
  x: String,
  y: Int
)

修改有状态平面映射器上的类型以处理(Wibble, Row)而不是仅Wibble

def wibbleStateFlatMapper(k: String,
                          it: Iterator[(Wibble, Row)],
                          state: GroupState[WibbleState]): Iterator[(Wibble, Row)] = it

现在我们的管道代码变为:

// imagine this is actually a streaming data frame
val input = spark.createDataFrame(List(("a", 1, 0), ("b", 1, 2)))
  .toDF("x", "y", "z")

val inputEncoder = RowEncoder(input.schema)
val wibbleEncoder = Encoders.product[Wibble]
implicit val tupleEncoder = Encoders.tuple(wibbleEncoder, inputEncoder)

input.select(struct($"x", $"y").as("wibble"), struct("*").as("data"))
  .as(tupleEncoder)
  .groupByKey({case (w,_) => w.x})
  .flatMapGroupsWithState(
    OutputMode.Append, GroupStateTimeout.NoTimeout)(wibbleStateFlatMapper)
  .select("_2.*")
  .show()