使用多个前面的行计算值

时间:2017-12-04 23:03:17

标签: apache-spark

我有一个included Type. Boolean Default. true Description. Should the files be included in the browser using <script> tag? Use false if you want to load them manually, eg. using Require.js. ,其中包含按时间戳排序的事件。 某些事件标志着一个新纪元的开始:

DataFrame

我想添加一个带有纪元编号的列,为简单起见,它可以等于其开头的时间戳:

+------+-----------+
| Time | Type      |
+------+-----------+
| 0    | New Epoch |
| 2    | Foo       |
| 3    | Bar       |
| 11   | New Epoch |
| 12   | Baz       |
+------+-----------+

我怎样才能做到这一点?

朴素算法是编写一个向后的函数,直到找到+------+-----------+–------+ | Time | Type | Epoch | +------+-----------+-------+ | 0 | New Epoch | 0 | | 2 | Foo | 0 | | 3 | Bar | 0 | | 11 | New Epoch | 11 | | 12 | Baz | 11 | +------+-----------+-------+ 的行并取$"Type" === "New Epoch"。如果我知道一个纪元内的最大事件数,我可以通过多次调用$"Time"来实现它。但对于一般情况,我没有任何想法。

1 个答案:

答案 0 :(得分:2)

以下是我的解决方案。简而言之,我创建了一个表示纪元间隔的数据框,然后将其与原始数据框连接。

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val ds = List((0, "New Epoch"), (2, "Fo"), (3, "Bar"), (11, "New Epoch"), (12, "Baz")).toDF("Time", "Type")
val epoch = ds.filter($"Type" === "New Epoch")
val spec = Window.orderBy("Time")
val epochInterval = epoch.withColumn("next_epoch", lead($"Time", 1).over(spec))//.show(false)
val result = ds.as("left").join(epochInterval.as("right"), $"left.Time" >= $"right.Time" && ($"left.Time" < $"right.next_epoch" || $"right.next_epoch".isNull))
      .select($"left.Time", $"left.Type", $"right.Time".as("Epoch"))
result.show(false)


+----+---------+-----+
|Time|Type     |Epoch|
+----+---------+-----+
|0   |New Epoch|0    |
|2   |Fo       |0    |
|3   |Bar      |0    |
|11  |New Epoch|11   |
|12  |Baz      |11   |
+----+---------+-----+