我在处理时间序列数据时遇到问题。由于电源故障,数据集中缺少某些时间戳。我需要通过添加行来填补这些空白,然后我可以插入缺失的值。
输入数据:
periodstart usage
---------------------------------
2015-09-11 02:15 23000
2015-09-11 03:15 23344
2015-09-11 03:30 23283
2015-09-11 03:45 23786
2015-09-11 04:00 25039
通缉输出:
periodstart usage
---------------------------------
2015-09-11 02:15 23000
2015-09-11 02:30 0
2015-09-11 02:45 0
2015-09-11 03:00 0
2015-09-11 03:15 23344
2015-09-11 03:30 23283
2015-09-11 03:45 23786
2015-09-11 04:00 25039
现在我已在数据集foreach函数中使用while循环修复此问题。问题是我必须首先将数据集收集到驱动程序才能执行while循环。所以这不是Spark的正确方法。
有人可以给我一个更好的解决方案吗?
这是我的代码:
MissingMeasurementsDS.collect().foreach(row => {
// empty list for new generated measurements
val output = ListBuffer.empty[Measurement]
// Missing measurements
val missingMeasurements = row.getAs[Int]("missingmeasurements")
val lastTimestamp = row.getAs[Timestamp]("previousperiodstart")
//Generate missing timestamps
var i = 1
while (i <= missingMeasurements) {
//Increment timestamp with 15 minutes (900000 milliseconds)
val newTimestamp = lastTimestamp.getTime + (900000 * i)
output += Measurement(new Timestamp(newTimestamp), 0))
i += 1
}
//Join interpolated measurements with correct measurements
completeMeasurementsDS.join(output.toDS())
})
completeMeasurementsDS.show()
println("OutputDF count = " + completeMeasurementsDS.count())
答案 0 :(得分:11)
如果输入DataFrame
具有以下结构:
root
|-- periodstart: timestamp (nullable = true)
|-- usage: long (nullable = true)
<强> Scala的强>
确定最小/最大:
val (minp, maxp) = df
.select(min($"periodstart").cast("bigint"), max($"periodstart".cast("bigint")))
.as[(Long, Long)]
.first
设置步骤,例如15分钟:
val step: Long = 15 * 60
生成参考范围:
val reference = spark
.range((minp / step) * step, ((maxp / step) + 1) * step, step)
.select($"id".cast("timestamp").alias("periodstart"))
加入并填补空白:
reference.join(df, Seq("periodstart"), "leftouter").na.fill(0, Seq("usage"))
<强>的Python 强>
同样在PySpark中:
from pyspark.sql.functions import col, min as min_, max as max_
step = 15 * 60
minp, maxp = df.select(
min_("periodstart").cast("long"), max_("periodstart").cast("long")
).first()
reference = spark.range(
(minp / step) * step, ((maxp / step) + 1) * step, step
).select(col("id").cast("timestamp").alias("periodstart"))
reference.join(df, ["periodstart"], "leftouter")