pyspark中两个时间戳之间的运行总和

时间:2018-06-28 06:01:43

标签: apache-spark pyspark apache-spark-sql

我有以下格式的数据:

+---------------------+----+----+---------+----------+
|      date_time      | id | cm | p_count |   bcm    |
+---------------------+----+----+---------+----------+
| 2018-02-01 04:38:00 | v1 | c1 |       1 |  null    |
| 2018-02-01 05:37:07 | v1 | c1 |       1 |  null    |
| 2018-02-01 11:19:38 | v1 | c1 |       1 |  null    |
| 2018-02-01 12:09:19 | v1 | c1 |       1 |  c1      |
| 2018-02-01 14:05:10 | v2 | c2 |       1 |  c2      |
+---------------------+----+----+---------+----------+

我需要找到两个date_time和按ID划分之间的p_count列的滚动总和。

开始日期和结束日期窗口的逻辑如下:

start_date_time=min(date_time) group by (id,cm)

end_date_time= bcm == cm ? date_time : null

在这种情况下,start_date_time = 2018-02-01 04:38:00和end_date_time = 2018-02-01 12:09:19。

输出应如下所示:

+---------------------+----+----+---------+----------+-------------+
|      date_time      | id | cm | p_count |   bcm    | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 |       1 |  null    |1            |
| 2018-02-01 05:37:07 | v1 | c1 |       1 |  null    |2            |
| 2018-02-01 11:19:38 | v1 | c1 |       1 |  null    |3            |
| 2018-02-01 12:09:19 | v1 | c1 |       1 |  c1      |4            |
| 2018-02-01 14:05:10 | v2 | c2 |       1 |  c2      |1            |
+---------------------+----+----+---------+----------+-------------+

1 个答案:

答案 0 :(得分:0)

var input = sqlContext.createDataFrame(Seq(
            ("2018-02-01 04:38:00", "v1", "c1",1,null),
            ("2018-02-01 05:37:07", "v1", "c1",1,null),
            ("2018-02-01 11:19:38", "v1", "c1",1,null),
            ("2018-02-01 12:09:19", "v1", "c1",1,"c1"),
            ("2018-02-01 14:05:10", "v2", "c2",1,"c2")
            )).toDF("date_time","id","cm","p_count" ,"bcm")

    input.show()

结果:

+---------------------+----+----+---------+----------+-------------+
|      date_time      | id | cm | p_count |   bcm    | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 |       1 |  null    |1            |
| 2018-02-01 05:37:07 | v1 | c1 |       1 |  null    |2            |
| 2018-02-01 11:19:38 | v1 | c1 |       1 |  null    |3            |
| 2018-02-01 12:09:19 | v1 | c1 |       1 |  c1      |4            |
| 2018-02-01 14:05:10 | v2 | c2 |       1 |  c2      |1            |
+---------------------+----+----+---------+----------+-------------+

下一个代码:

        input.createOrReplaceTempView("input_Table");
        import org.apache.spark.sql.expressions.Window
        import org.apache.spark.sql.functions._

        //val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")

        val results = sqlContext.sql("select *, " +
          "SUM(p_count) over ( order by id  rows between unbounded preceding and current row ) cumulative_Sum " +
          "from input_Table ").show

结果:

+-------------------+---+---+-------+----+--------------+
|          date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1|      1|null|             1|
|2018-02-01 05:37:07| v1| c1|      1|null|             2|
|2018-02-01 11:19:38| v1| c1|      1|null|             3|
|2018-02-01 12:09:19| v1| c1|      1|  c1|             4|
|2018-02-01 14:05:10| v2| c2|      1|  c2|             5|
+-------------------+---+---+-------+----+--------------+
  

您需要在窗口显示时进行分组,并添加逻辑以获取预期的结果

行数不受限制的行和当前行之间的行

逻辑上,基于开始行和结束行之间的所有ROWS,为PARTITION中的每一行重新计算一个窗口聚合函数。

基于以下关键字,开始和结束行可能是固定的,也可能是相对于当前行的:

  • UNBOUNDED PRECEDING,当前行之前的所有行->已固定
  • 未绑定,当前行之后的所有行->已固定
  • x PRECEDING,当前行之前的x行->相对
  • y关注中,当前行之后的y行->相对

可能的计算类型包括:

起始行和结束行都是固定的,窗口由分区的所有行组成,例如组总和,即汇总加明细行

一端固定,另一端相对于当前行,行数增加或减少,例如剩余总金额

开始和结束行相对于当前行,窗口中的行数是固定的,例如n行上的移动平均线

因此,SUM(x)OVER(按col ROWS无限制前缀进行排序)会导致累积总和或运行总计

11 -> 11
 2 -> 11 +  2                = 13
 3 -> 13 +  3 (or 11+2+3)    = 16
44 -> 16 + 44 (or 11+2+3+44) = 60

What is ROWS UNBOUNDED PRECEDING used for in Teradata?