我需要对数据帧行进行状态处理。为此,我需要创建一个bean或case类,为有状态处理所需的数据建模。我想在状态处理后继续使用数据帧中的其他数据,而无需在case类中对其进行建模。该怎么办?
在无状态处理中,我们可以使用UDF保留在DataFrame区域中,但是这里没有该选项。
这是我尝试过的:
package com.example.so
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode}
case class WibbleState() // just a placeholder
case class Wibble
(
x: String,
y: Int,
data: Row // data I don't want to model in the case class
)
object PartialModelization {
def wibbleStateFlatMapper(k: String,
it: Iterator[Wibble],
state: GroupState[WibbleState]): Iterator[Wibble] = it
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("PartialModelization")
.master("local[*]").getOrCreate()
import spark.implicits._
// imagine this is actually a streaming data frame
val input = spark.createDataFrame(List(("a", 1, 0), ("b", 1, 2)))
.toDF("x", "y", "z")
// dont want to model z in the case class
// if that seems pointless imagine there is also z1, z2, z3, etc
// or that z is itself a struct
input.select($"x", $"y", struct("*").as("data"))
.as[Wibble]
.groupByKey(w => w.x)
.flatMapGroupsWithState[WibbleState, Wibble](
OutputMode.Append, GroupStateTimeout.NoTimeout)(wibbleStateFlatMapper)
.select("data.*")
.show()
}
}
哪个出现此错误:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "data")
- root class: "com.example.so.Wibble"
从概念上讲,您可能建议尝试找到一些密钥,该密钥可以让我们与输入数据一起加入输出数据帧以恢复“数据”属性,但是从性能和实现复杂性的角度来看,这似乎是一个可怕的解决方案。 (在这种情况下,我宁愿只在案例类中输入整个数据结构!)
答案 0 :(得分:0)
到目前为止,我发现的最佳解决方案是使用元组来分隔映射器数据和行数据。
因此,我们从Wibble
中删除了data属性。
case class Wibble
(
x: String,
y: Int
)
修改有状态平面映射器上的类型以处理(Wibble, Row)
而不是仅Wibble
:
def wibbleStateFlatMapper(k: String,
it: Iterator[(Wibble, Row)],
state: GroupState[WibbleState]): Iterator[(Wibble, Row)] = it
现在我们的管道代码变为:
// imagine this is actually a streaming data frame
val input = spark.createDataFrame(List(("a", 1, 0), ("b", 1, 2)))
.toDF("x", "y", "z")
val inputEncoder = RowEncoder(input.schema)
val wibbleEncoder = Encoders.product[Wibble]
implicit val tupleEncoder = Encoders.tuple(wibbleEncoder, inputEncoder)
input.select(struct($"x", $"y").as("wibble"), struct("*").as("data"))
.as(tupleEncoder)
.groupByKey({case (w,_) => w.x})
.flatMapGroupsWithState(
OutputMode.Append, GroupStateTimeout.NoTimeout)(wibbleStateFlatMapper)
.select("_2.*")
.show()