有了火花,我定义了一个窗口:
val window = Window
.partitionBy("myaggcol")
.orderBy("datefield")
.rowsBetween(-2, 0)
然后我可以从窗口的行中计算出新列,例如:
dataset
.withColumn("newcol", last("diffcol").over(window) - first("diffcol").over(window))
这将为每个点计算n-2行中“ diffcol”的差。
现在我的问题是:如何获得n-1行的“ diffcol”,而不是第一行或最后一行,而是中介行?
答案 0 :(得分:1)
如果我正确理解了您的问题,则窗口函数lag
会比rowsBetween
更好,如下例所示:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = Seq(
("a", 1, 100), ("a", 2, 200), ("a", 3, 300), ("a", 4, 400),
("b", 1, 500), ("b", 2, 600), ("b", 3, 700)
).toDF("c1", "c2", "c3")
val win = Window.partitionBy("c1").orderBy("c2")
df.
withColumn("c3Diff1", $"c3" - coalesce(lag("c3", 1).over(win), lit(0))).
withColumn("c3Diff2", $"c3" - coalesce(lag("c3", 2).over(win), lit(0))).
show
// +---+---+---+-------+-------+
// | c1| c2| c3|c3Diff1|c3Diff2|
// +---+---+---+-------+-------+
// | b| 1|500| 500| 500|
// | b| 2|600| 100| 600|
// | b| 3|700| 100| 200|
// | a| 1|100| 100| 100|
// | a| 2|200| 100| 200|
// | a| 3|300| 100| 200|
// | a| 4|400| 100| 200|
// +---+---+---+-------+-------+