带条件的Spark SQL窗口函数范围边界

时间:2020-01-16 12:32:45

标签: sql pyspark-sql window-functions

我的数据如下:

         Sequence|       type      | sg       |
+-----------------+----------------+----------+
|              1| Pump             |3         |
|              2| Pump             |2         |
|              3| Inject           |4         |
|              4| Pump             |5         |
|              5| Pump             |3         | 
|              6| pump             |6         |
|              7| Inject           |7         |
|              8| Inject           |8         |
|              9| Pump             |9         |
+-----------------+----------------+----------+

我想添加一个新列并检查先前的type值。

如果先前的type值为Pump,则将新列的值设置为相应的sg的值。

如果它是inject,则获取所有先前行的sg值的和,直到找到具有Pump type的行(其{{1 }}的值包含在总和中。

EX: 对于sg,上一行的Sequence = 2type,因此新列的值应该是对应的Pump列的值:3。

对于sg,上一行的Sequence = 9type,因此新列的值将是前三行的Inject列的总和,因为{{1 }}行是带有sg的第一行。新列的值将为Sequence = 6

最终输出应为:

type = Pump

1 个答案:

答案 0 :(得分:1)

根据您的规则,这只是一堆窗口函数。诀窍是使用“注入”按组汇总“泵”值。累积的“泵”总和找到了这些组。

然后查询为:

select t.*,
        (case when prev_type = 'Pump' then sg
              else lag(pump_sg) over (order by id)
         end) as your_value
from (select t.*,
             sum(sg) over (partition by pump_grp) as pump_sg
      from (select t.*,
                   lag(sg) over (order by id) as prev_sg,
                   lag(type) over (order by id) as prev_type,
                 sum(case when type = 'Pump' then 1 else 0 end) over (order by id) as pump_grp
            from t
           ) t
     ) t;

我认为您的规则过于复杂,对于上一行的“ pump”,您不需要特殊的情况。所以:

select t.*,
       lag(pump_sg) over (order by id) as your_value
from (select t.*,
             sum(sg) over (partition by pump_grp) as pump_sg
      from (select t.*,
                 sum(case when type = 'Pump' then 1 else 0 end) over (order by id) as pump_grp
            from t
           ) t
     ) t;