SQL:在不使用ROWS UNBOUNDED PRECEDING的情况下运行相同事务的总计

时间:2017-08-17 18:15:19

标签: sql apache-spark-sql pyspark-sql

我正在尝试计算某一天司机获得的“出租车票价”的总计“。最初在Netezza上测试,现在尝试在spark-sql上编码。

但是如果两行的结构为((司机,日) - >票价)如果'票价'值相同那么running_total列总是显示最终总和!如果所有票价不同,则计算得非常完美。有没有办法实现这一点(在ANSI SQL或Spark数据帧中)而不使用rowsBetween(start,end)?

示例数据:

    driver_id<<<<>>>>date_id  <<<<>>>>fare
    10001           2017-07-27        500
    10001           2017-07-27        500
    10001           2017-07-30        500
    10001           2017-07-30        1500

SQL查询我开始计算运行总计

    select driver_id, date_id, fare , 
    sum(fare)
    over(partition by date_id,driver_id 
    order by date_id,fare ) 
    as run_tot_fare
    from trip_info
    order by 2

结果:

  driver_id <<<<>>>> date_id <<<<>>>> fare <<<<>>>> run_tot_fare
  10001              2017-07-27       500           1000 --**Showing Final Total expecting 500**
  10001              2017-07-27       500           1000
  10001              2017-07-30       500           500 --**No problem here**
  10001              2017-07-30       1500          2000

如果有人可以让我知道,我做错了什么,如果没有使用Rows Unbounded Precedings / rowsBetween(b,e)可以实现,那么我非常感谢。提前谢谢。

1 个答案:

答案 0 :(得分:1)

SQL中的传统解决方案是使用range而不是rows

select driver_id, date_id, fare , 
       sum(fare) over (partition by date_id, driver_id 
                       order by date_id, fare
                       range between unbounded preceding and current rows
                      ) as run_tot_fare
from trip_info
order by 2;

缺席,有两级窗口函数或聚合和连接:

select driver_id, date_id, fare,
       max(run_tot_fare_temp) over (partition by date_id, driver_id ) as run_tot_fare
from (select driver_id, date_id, fare , 
             sum(fare) over (partition by date_id, driver_id 
                             order by date_id, fare
                            ) as run_tot_fare_temp
      from trip_info ti
  ) ti
order by 2;

max()假设票价从不为负。)