我有以下格式的数据:
+---------------------+----+----+---------+----------+
| date_time | id | cm | p_count | bcm |
+---------------------+----+----+---------+----------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |
+---------------------+----+----+---------+----------+
我需要找到两个date_time和按ID划分之间的p_count列的滚动总和。
开始日期和结束日期窗口的逻辑如下:
start_date_time=min(date_time) group by (id,cm)
end_date_time= bcm == cm ? date_time : null
在这种情况下,start_date_time = 2018-02-01 04:38:00和end_date_time = 2018-02-01 12:09:19。
输出应如下所示:
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+
答案 0 :(得分:0)
var input = sqlContext.createDataFrame(Seq(
("2018-02-01 04:38:00", "v1", "c1",1,null),
("2018-02-01 05:37:07", "v1", "c1",1,null),
("2018-02-01 11:19:38", "v1", "c1",1,null),
("2018-02-01 12:09:19", "v1", "c1",1,"c1"),
("2018-02-01 14:05:10", "v2", "c2",1,"c2")
)).toDF("date_time","id","cm","p_count" ,"bcm")
input.show()
结果:
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+
下一个代码:
input.createOrReplaceTempView("input_Table");
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")
val results = sqlContext.sql("select *, " +
"SUM(p_count) over ( order by id rows between unbounded preceding and current row ) cumulative_Sum " +
"from input_Table ").show
结果:
+-------------------+---+---+-------+----+--------------+
| date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1| 1|null| 1|
|2018-02-01 05:37:07| v1| c1| 1|null| 2|
|2018-02-01 11:19:38| v1| c1| 1|null| 3|
|2018-02-01 12:09:19| v1| c1| 1| c1| 4|
|2018-02-01 14:05:10| v2| c2| 1| c2| 5|
+-------------------+---+---+-------+----+--------------+
您需要在窗口显示时进行分组,并添加逻辑以获取预期的结果
行数不受限制的行和当前行之间的行
逻辑上,基于开始行和结束行之间的所有ROWS,为PARTITION中的每一行重新计算一个窗口聚合函数。
基于以下关键字,开始和结束行可能是固定的,也可能是相对于当前行的:
可能的计算类型包括:
起始行和结束行都是固定的,窗口由分区的所有行组成,例如组总和,即汇总加明细行
一端固定,另一端相对于当前行,行数增加或减少,例如剩余总金额
开始和结束行相对于当前行,窗口中的行数是固定的,例如n行上的移动平均线
因此,SUM(x)OVER(按col ROWS无限制前缀进行排序)会导致累积总和或运行总计
11 -> 11
2 -> 11 + 2 = 13
3 -> 13 + 3 (or 11+2+3) = 16
44 -> 16 + 44 (or 11+2+3+44) = 60