我正在尝试根据每月和每周对每天的谷物进行分组。然后,我尝试找出各个星期的最后6周和6个月的总值是多少。
注意:在我的情况下,数据量巨大,约为3000万。
因此,我现在采用的方法是创建多个临时表,例如一个临时表,用于每周级别的汇总数据,第二个临时表,用于每月汇总的数据,依此类推。所以这是我完整的方法。谁能建议一种使之最优化的方法。
更新:添加了输入和预期输出
输入:
预期产量:
--Date dim
create temp table date_dim(report_end_wk,start_dt,end_dt,wkno) as(
select to_date('2019-08-03','YYYY-MM-DD'),to_date('2019-07-28','YYYY-MM-DD'),to_date('2019-08-03','YYYY-MM-DD'),31 union
select to_date('2019-07-27','YYYY-MM-DD'),to_date('2019-07-21','YYYY-MM-DD'),to_date('2019-07-27','YYYY-MM-DD'),30 union
select to_date('2019-07-20','YYYY-MM-DD'),to_date('2019-07-14','YYYY-MM-DD'),to_date('2019-07-20','YYYY-MM-DD'),29);
--main table with data at daily grain
create temp table t1(daily_dt,tvtype,sale) as(
select to_date('2019-07-29','YYYY-MM-DD'),'mitv',3000 union
select to_date('2019-08-02','YYYY-MM-DD'),'mitv',3000 union
select to_date('2019-07-30','YYYY-MM-DD'),'samsung',4000 union
select to_date('2019-08-01','YYYY-MM-DD'),'samsung',3000 union
select to_date('2019-07-23','YYYY-MM-DD'),'mitv',2000 union
select to_date('2019-07-26','YYYY-MM-DD'),'mitv',3000 union
select to_date('2019-07-22','YYYY-MM-DD'),'samsung',9000 union
select to_date('2019-07-25','YYYY-MM-DD'),'samsung',3000 );
--getting aggregation as weekly grain
create temp table wk_level_agg as(
select report_end_wk,wkno,to_date(report_end_wk,'YYYY-MM') as monthly_dt,tvtype,sum(sale) as wk_sale from t1 join date_dim on daily_dt between start_dt and end_dt
group by report_end_wk,wkno,to_date(report_end_wk,'YYYY-MM'),tvtype);
--getting aggregation as monthly grain
create temp table month_level_agg as(
select monthly_dt, tvtype, sum(wk_sale) as monthly_sale from wk_level_agg
group by monthly_dt,tvtype);
--getting last 6 week aggregated data at column level. here i have used only last week for example
create temp table wk_hist_agg as(
select report_end_wk,wkno,monthly_dt,tvtype,wk_sale, sum(wk_1_sale) as wk_1_sale from(
select a.*, CASE
WHEN nvl (datediff (week,b.report_end_wk,a.report_end_wk),0) = 1 THEN b.wk_sale
ELSE 0
END AS wk_1_sale from wk_level_agg a
left join wk_level_agg b
on a.tvtype=b.tvtype and (b.report_end_wk BETWEEN TRUNC (dateadd (week,-6,a.report_end_wk))
AND TRUNC (dateadd (week,-1,a.report_end_wk))))
group by report_end_wk,wkno,monthly_dt,tvtype,wk_sale);
--getting last 6 month aggregated data at column level. here i have used only last 1 month for example
create temp table month_hist_agg as(
select monthly_dt,tvtype,monthly_sale, sum(mth_1_sale) as mth_1_sale from(
select a.*, CASE
WHEN nvl (datediff (month,b.monthly_dt,a.monthly_dt),0) = 1 THEN b.monthly_sale
ELSE 0
END AS mth_1_sale from month_level_agg a
left join month_level_agg b
on a.tvtype=b.tvtype and (b.monthly_dt BETWEEN TRUNC (dateadd (month,-6,a.monthly_dt))
AND TRUNC (dateadd (month,-1,a.monthly_dt))))
group by monthly_dt,tvtype,monthly_sale);
--final table data at weekly level and last 6 week and monthly aggregated data at column level
select a.*,b.monthly_sale,b.mth_1_sale from
wk_hist_agg a left join month_hist_agg b on a.monthly_dt=b.monthly_dt and a.tvtype=b.tvtype
order by a.report_end_wk desc;
答案 0 :(得分:2)
我看到了许多我可以给您的优化建议。我不确定我是否有时间在示例中组装它们。
使用common table expressions-所谓的WITH
语句而不是临时表。临时表的运行速度可能更快,但是Redshift应该足够快以处理聚合(在您的情况下为10或1亿行)。
您将有一个查询,并将每个临时表步骤声明为表变量:
WITH
my_first_table as (SELECT ... ),
my_second_table as (SELECT ... FROM my_first_table ),
my_third_table as (SELECT ... FROM my_second_table )
SELECT
...
FROM any_of_the_above_declared_tables
使用DATE_TRUNC
生成不同的日期粒度:
SELECT DATE_TRUNC('month', '2019-08-14'::DATE); -- will return 2019-08-01
SELECT DATE_TRUNC('week', '2019-08-14'::DATE); -- will return 2019-08-12
使用TO_CHAR
来获取日历周:
select to_char('2019-08-14'::DATE, 'WW'); -- returns 33
使用SUM(CASE WHEN date_condition THEN value END)
来获取一段时间内的总和可能会更容易-但这取决于您构造转换的方式