Question

我将域和月份与它们在相应月份的总订单相结合。我想用0值估算缺少的组合。可以在Pyspark中使用的最便宜的聚合命令是什么？

我有以下输入表：

domain      month    year   total_orders
google.com  01       2017   20
yahoo.com   02       2017   30
google.com  03       2017   30
yahoo.com   03       2017   40
a.com       04       2017   50
a.com       05       2017   50
a.com       06       2017   50

预期输出：

domain      month    year   total_orders
google.com  01       2017   20
yahoo.com   02       2017   30
google.com  03       2017   30
yahoo.com   03       2017   40
a.com       04       2017   50
a.com       05       2017   50
a.com       06       2017   50
google.com  02       2017   0
google.com  04       2017   0
yahoo.com   04       2017   0
google.com  05       2017   0
yahoo.com   05       2017   0
google.com  06       2017   0
yahoo.com   06       2017   0

此处预期的输出顺序并不重要。

Answer 1

最简单的方法是为每个域合并所有月份和年份：

select my.year, my.month, d.domain, coalesce(t.total_orders, 0) as total_orders
from (select distinct month, year from input) my cross join
     (select distinct domain from input) d left join
     t
     on t.month = my.month and t.year = my.year and t.domain = d.domain;

注意：这假设每个年/月组合在数据中的某处至少发生一次。

获取一个范围内的值很痛苦，因为您已将日期分为多个列。让我假设年份都一样，如您的示例：

select my.year, my.month, d.domain, coalesce(t.total_orders, 0) as total_orders
from (select distinct month, year from input) my join
     (select domain, min(month) as min_month, max(month) as max_month
      from input
     ) d
     on my.month >= d.min_month and my.month <= d.max_month left join
     t
     on t.month = my.month and t.year = my.year and t.domain = d.domain

如何为缺少的数据组合添加行并使用0估算对应的字段

1 个答案: