我有一个included
Type. Boolean
Default. true
Description. Should the files be included in the browser using
<script> tag? Use false if you want to load them manually,
eg. using Require.js.
,其中包含按时间戳排序的事件。
某些事件标志着一个新纪元的开始:
DataFrame
我想添加一个带有纪元编号的列,为简单起见,它可以等于其开头的时间戳:
+------+-----------+
| Time | Type |
+------+-----------+
| 0 | New Epoch |
| 2 | Foo |
| 3 | Bar |
| 11 | New Epoch |
| 12 | Baz |
+------+-----------+
我怎样才能做到这一点?
朴素算法是编写一个向后的函数,直到找到+------+-----------+–------+
| Time | Type | Epoch |
+------+-----------+-------+
| 0 | New Epoch | 0 |
| 2 | Foo | 0 |
| 3 | Bar | 0 |
| 11 | New Epoch | 11 |
| 12 | Baz | 11 |
+------+-----------+-------+
的行并取$"Type" === "New Epoch"
。如果我知道一个纪元内的最大事件数,我可以通过多次调用$"Time"
来实现它。但对于一般情况,我没有任何想法。
答案 0 :(得分:2)
以下是我的解决方案。简而言之,我创建了一个表示纪元间隔的数据框,然后将其与原始数据框连接。
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val ds = List((0, "New Epoch"), (2, "Fo"), (3, "Bar"), (11, "New Epoch"), (12, "Baz")).toDF("Time", "Type")
val epoch = ds.filter($"Type" === "New Epoch")
val spec = Window.orderBy("Time")
val epochInterval = epoch.withColumn("next_epoch", lead($"Time", 1).over(spec))//.show(false)
val result = ds.as("left").join(epochInterval.as("right"), $"left.Time" >= $"right.Time" && ($"left.Time" < $"right.next_epoch" || $"right.next_epoch".isNull))
.select($"left.Time", $"left.Type", $"right.Time".as("Epoch"))
result.show(false)
+----+---------+-----+
|Time|Type |Epoch|
+----+---------+-----+
|0 |New Epoch|0 |
|2 |Fo |0 |
|3 |Bar |0 |
|11 |New Epoch|11 |
|12 |Baz |11 |
+----+---------+-----+