我将域和月份与它们在相应月份的总订单相结合。我想用0值估算缺少的组合。可以在Pyspark中使用的最便宜的聚合命令是什么?
我有以下输入表:
domain month year total_orders
google.com 01 2017 20
yahoo.com 02 2017 30
google.com 03 2017 30
yahoo.com 03 2017 40
a.com 04 2017 50
a.com 05 2017 50
a.com 06 2017 50
预期输出:
domain month year total_orders
google.com 01 2017 20
yahoo.com 02 2017 30
google.com 03 2017 30
yahoo.com 03 2017 40
a.com 04 2017 50
a.com 05 2017 50
a.com 06 2017 50
google.com 02 2017 0
google.com 04 2017 0
yahoo.com 04 2017 0
google.com 05 2017 0
yahoo.com 05 2017 0
google.com 06 2017 0
yahoo.com 06 2017 0
此处预期的输出顺序并不重要。
答案 0 :(得分:0)
最简单的方法是为每个域合并所有月份和年份:
select my.year, my.month, d.domain, coalesce(t.total_orders, 0) as total_orders
from (select distinct month, year from input) my cross join
(select distinct domain from input) d left join
t
on t.month = my.month and t.year = my.year and t.domain = d.domain;
注意:这假设每个年/月组合在数据中的某处至少发生一次。
获取一个范围内的值很痛苦,因为您已将日期分为多个列。让我假设年份都一样,如您的示例:
select my.year, my.month, d.domain, coalesce(t.total_orders, 0) as total_orders
from (select distinct month, year from input) my join
(select domain, min(month) as min_month, max(month) as max_month
from input
) d
on my.month >= d.min_month and my.month <= d.max_month left join
t
on t.month = my.month and t.year = my.year and t.domain = d.domain