我正在尝试计算某一天司机获得的“出租车票价”的总计“。最初在Netezza上测试,现在尝试在spark-sql上编码。
但是如果两行的结构为((司机,日) - >票价)如果'票价'值相同那么running_total列总是显示最终总和!如果所有票价不同,则计算得非常完美。有没有办法实现这一点(在ANSI SQL或Spark数据帧中)而不使用rowsBetween(start,end)?
示例数据:
driver_id<<<<>>>>date_id <<<<>>>>fare
10001 2017-07-27 500
10001 2017-07-27 500
10001 2017-07-30 500
10001 2017-07-30 1500
SQL查询我开始计算运行总计
select driver_id, date_id, fare ,
sum(fare)
over(partition by date_id,driver_id
order by date_id,fare )
as run_tot_fare
from trip_info
order by 2
结果:
driver_id <<<<>>>> date_id <<<<>>>> fare <<<<>>>> run_tot_fare
10001 2017-07-27 500 1000 --**Showing Final Total expecting 500**
10001 2017-07-27 500 1000
10001 2017-07-30 500 500 --**No problem here**
10001 2017-07-30 1500 2000
如果有人可以让我知道,我做错了什么,如果没有使用Rows Unbounded Precedings / rowsBetween(b,e)可以实现,那么我非常感谢。提前谢谢。
答案 0 :(得分:1)
SQL中的传统解决方案是使用range
而不是rows
:
select driver_id, date_id, fare ,
sum(fare) over (partition by date_id, driver_id
order by date_id, fare
range between unbounded preceding and current rows
) as run_tot_fare
from trip_info
order by 2;
缺席,有两级窗口函数或聚合和连接:
select driver_id, date_id, fare,
max(run_tot_fare_temp) over (partition by date_id, driver_id ) as run_tot_fare
from (select driver_id, date_id, fare ,
sum(fare) over (partition by date_id, driver_id
order by date_id, fare
) as run_tot_fare_temp
from trip_info ti
) ti
order by 2;
(max()
假设票价从不为负。)