我有一个大型数据框,其中包含222个我想要的列,如下例
|id |day |col1 |col2 | col3 ....................
+----------+----------------+-------+-----+
| 329| 0| null|2.0
| 329| 42| null|null
| 329| 72| 5.55|null
| 329| 106| null|null
| 329| 135| null|3.0
| 329| 168| null|4.0
| 329| 189| 4.995|null
| 329| 212| null|6.0
| 329| 247| null|null
| 329| 274| null|8.0
|id | day |col1 |col2 |.......................
+----------+----------------+-------+-----+
| 329| 0| null|2.0
| 329| 42| null|2.0
| 329| 72| 5.55|2.0
| 329| 106| 5.55|2.0
| 329| 135| 5.55|3.0
| 329| 168| 5.55|4.0
| 329| 189| 4.995|4.0
| 329| 212| 4.995|6.0
| 329| 247| 4.995|6.0
| 329| 274| 4.995|8.0
.
.
.
.
.
1.读第1行 2.i拥有85k个唯一ID,并且每个id有10个结果(仅显示一个ID的示例) 3.如果第2行中没有数据,则从ID的前一行中取出
我得到了这样的结果
id | day |original_col1 |Result_col1|prevValue|
+----------+----------------+--------------+-----------+---------+
| 329| 0| null | null | null|
| 329| 42| null | null | null|
| 329| 72| 5.55 | 5.55 | null|
| 329| 106| null | 5.55 | 5.55|
| 329| 135| null | null | null|
| 329| 168| null | null | null|
| 329| 189| 4.995 | 4.995 | null|
| 329| 212| null | 4.995 | 4.995|
| 330|....................................................
| 330|.....................................................
.
答案 0 :(得分:3)
使用现有的窗口函数(例如滞后)无法实现这一点。您需要使用类似的概念进行分区和排序,但使用自定义逻辑来滚动非空值。
case class MyRec(id: Integer, day: Integer, col1: Option[Double], col2: Option[Double])
defined class MyRec
scala> :paste
// Entering paste mode (ctrl-D to finish)
val ds = Seq(
MyRec(329, 0, None, Some(2.0)),
MyRec(329, 42, None, None),
MyRec(329, 72, Some(5.55), None),
MyRec(329, 106, None, None),
MyRec(329, 135, None, Some(3.0)),
MyRec(329, 168, None, Some(4.0)),
MyRec(329, 189, Some(4.995), None),
MyRec(329, 212, None, Some(6.0)),
MyRec(329, 247, None, None),
MyRec(329, 274, None, Some(8.0))
).toDS()
ds.printSchema()
ds.show(false)
val updated_ds = ds.repartition('id).sortWithinPartitions('id, 'day)
.mapPartitions(iter => {
var crtId: Integer = null
var prevId: Integer = null
var rollingVals = collection.mutable.Map[String, Option[Double]]()
for (rec <- iter) yield {
crtId = rec.id
// 1st record for new id
if (prevId == null || crtId != prevId) {
rollingVals = collection.mutable.Map[String, Option[Double]]()
prevId = crtId
}
rollingVals("col1") = if (rec.col1.isDefined) rec.col1 else rollingVals.getOrElse("col1", None)
rollingVals("col2") = if (rec.col2.isDefined) rec.col2 else rollingVals.getOrElse("col2", None)
MyRec(rec.id, rec.day, rollingVals("col1"), rollingVals("col2"))
}
})
updated_ds.printSchema()
updated_ds.show(false)
// Exiting paste mode, now interpreting.
root
|-- id: integer (nullable = true)
|-- day: integer (nullable = true)
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
+---+---+-----+----+
|id |day|col1 |col2|
+---+---+-----+----+
|329|0 |null |2.0 |
|329|42 |null |null|
|329|72 |5.55 |null|
|329|106|null |null|
|329|135|null |3.0 |
|329|168|null |4.0 |
|329|189|4.995|null|
|329|212|null |6.0 |
|329|247|null |null|
|329|274|null |8.0 |
+---+---+-----+----+
root
|-- id: integer (nullable = true)
|-- day: integer (nullable = true)
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
+---+---+-----+----+
|id |day|col1 |col2|
+---+---+-----+----+
|329|0 |null |2.0 |
|329|42 |null |2.0 |
|329|72 |5.55 |2.0 |
|329|106|5.55 |2.0 |
|329|135|5.55 |3.0 |
|329|168|5.55 |4.0 |
|329|189|4.995|4.0 |
|329|212|4.995|6.0 |
|329|247|4.995|6.0 |
|329|274|4.995|8.0 |
+---+---+-----+----+
ds: org.apache.spark.sql.Dataset[MyRec] = [id: int, day: int ... 2 more fields]
updated_ds: org.apache.spark.sql.Dataset[MyRec] = [id: int, day: int ... 2 more fields]
答案 1 :(得分:1)
使用窗口功能,然后使用case-when:
val df2 = df
.withColumn("prevValue", lag('col1, 1).over(Window.partitionBy('id).orderBy('day)))
.withColumn("col1", when('col1.isNull, 'prevValue).otherwise('col1))
导入spark.implicits ._