给出S的数据集,目标是生成E的数据集,其中S和E定义如下:
// event where start (s) is inclusive, end (e) is exclusive
case class E(id: Int, state: String, s: Int, e: Option[Int])
//snapshot with state at t for an id
case class S(id: Int, state: String, time: Int)
//Example test case
val ss: Dataset[S] = Seq(S(100, "a", 1), S(100, "b", 2), S(100, "b", 3), S(100, "a", 4), S(100, "a", 5), S(100, "a", 6), S(100, "c", 9))
.toDS
val es: Dataset[E] = ss
.toEs
es.collect() must contain theSameElementsAs
Seq(E(100, "a", 1, Some(2)), E(100, "b", 2, Some(4)), E(100, "a", 4, Some(9)), E(100, "c", 9, None))
一个状态可以有多个快照(在不同的时间),但是输出应该累积有效的开始和结束时间。同样,最后一个活动状态在输出中应该没有结束日期(选项)。
上面的 toEs
定义如下:
implicit class SOps(ss: Dataset[S]) {
def toEs(implicit spark: SparkSession): Dataset[E] = ???
}
下图描述了 desired transformation
答案 0 :(得分:0)
以下是使用flatMapGroups
的解决方案,如果组太大而无法容纳在内存中,则会溢出到磁盘上
def toEs(implicit spark: SparkSession): Dataset[E] = {
import spark.implicits._
ss
.sort(ss("id"), ss("t"))
.groupByKey(s => s.id)
.flatMapGroups { (_, ss) =>
new Iterator[E] {
var nextStart: Option[S] = None
override def hasNext: Boolean = ss.hasNext || nextStart.isDefined
override def next(): E = {
if (ss.hasNext) {
val start = nextStart.getOrElse(ss.next())
var last = ss.next()
while (last.state == start.state)
last = ss.next()
nextStart = Some(last)
E(start.id, start.state, start.t, Some(last.t))
} else {
val Some(start) = nextStart
nextStart = None
E(start.id, start.state, start.t, None)
}
}
}
}
}
看起来非常必要,所以不是超级高兴:(