Question

我正在尝试根据每月和每周对每天的谷物进行分组。然后，我尝试找出各个星期的最后6周和6个月的总值是多少。

注意：在我的情况下，数据量巨大，约为3000万。

因此，我现在采用的方法是创建多个临时表，例如一个临时表，用于每周级别的汇总数据，第二个临时表，用于每月汇总的数据，依此类推。所以这是我完整的方法。谁能建议一种使之最优化的方法。

更新：添加了输入和预期输出

输入：

预期产量：

--Date dim
    create temp table date_dim(report_end_wk,start_dt,end_dt,wkno) as(
    select to_date('2019-08-03','YYYY-MM-DD'),to_date('2019-07-28','YYYY-MM-DD'),to_date('2019-08-03','YYYY-MM-DD'),31 union
    select to_date('2019-07-27','YYYY-MM-DD'),to_date('2019-07-21','YYYY-MM-DD'),to_date('2019-07-27','YYYY-MM-DD'),30 union
    select to_date('2019-07-20','YYYY-MM-DD'),to_date('2019-07-14','YYYY-MM-DD'),to_date('2019-07-20','YYYY-MM-DD'),29);

    --main table with data at daily grain
    create temp table t1(daily_dt,tvtype,sale) as(
    select to_date('2019-07-29','YYYY-MM-DD'),'mitv',3000 union
    select to_date('2019-08-02','YYYY-MM-DD'),'mitv',3000 union
    select to_date('2019-07-30','YYYY-MM-DD'),'samsung',4000 union
    select to_date('2019-08-01','YYYY-MM-DD'),'samsung',3000 union
    select to_date('2019-07-23','YYYY-MM-DD'),'mitv',2000 union
    select to_date('2019-07-26','YYYY-MM-DD'),'mitv',3000 union
    select to_date('2019-07-22','YYYY-MM-DD'),'samsung',9000 union
    select to_date('2019-07-25','YYYY-MM-DD'),'samsung',3000 );

    --getting aggregation as weekly grain
    create temp table wk_level_agg as(
    select report_end_wk,wkno,to_date(report_end_wk,'YYYY-MM') as monthly_dt,tvtype,sum(sale) as wk_sale from t1 join date_dim on daily_dt between start_dt and end_dt
    group by report_end_wk,wkno,to_date(report_end_wk,'YYYY-MM'),tvtype);

    --getting aggregation as monthly grain
    create temp table month_level_agg as(
    select  monthly_dt, tvtype, sum(wk_sale) as monthly_sale from wk_level_agg
    group by monthly_dt,tvtype);

    --getting last 6 week aggregated data at column level. here i have used only last week for example
    create temp table wk_hist_agg as(
    select report_end_wk,wkno,monthly_dt,tvtype,wk_sale, sum(wk_1_sale) as wk_1_sale from(
    select a.*, CASE
                   WHEN nvl (datediff (week,b.report_end_wk,a.report_end_wk),0) = 1 THEN b.wk_sale
                   ELSE 0
                 END AS wk_1_sale from wk_level_agg a
    left join wk_level_agg b
    on a.tvtype=b.tvtype and  (b.report_end_wk BETWEEN TRUNC (dateadd (week,-6,a.report_end_wk))
                  AND TRUNC (dateadd (week,-1,a.report_end_wk))))
                  group by report_end_wk,wkno,monthly_dt,tvtype,wk_sale);

    --getting last 6 month aggregated data at column level. here i have used only last 1 month for example
    create temp table month_hist_agg as(
    select monthly_dt,tvtype,monthly_sale, sum(mth_1_sale) as mth_1_sale from(
    select a.*, CASE
                   WHEN nvl (datediff (month,b.monthly_dt,a.monthly_dt),0) = 1 THEN b.monthly_sale
                   ELSE 0
                 END AS mth_1_sale from month_level_agg a
    left join month_level_agg b
    on a.tvtype=b.tvtype and  (b.monthly_dt BETWEEN TRUNC (dateadd (month,-6,a.monthly_dt))
                  AND TRUNC (dateadd (month,-1,a.monthly_dt))))
                  group by monthly_dt,tvtype,monthly_sale);


    --final table data at weekly level and last 6 week and monthly aggregated data at column level
    select a.*,b.monthly_sale,b.mth_1_sale from 
    wk_hist_agg a left join month_hist_agg b on a.monthly_dt=b.monthly_dt and a.tvtype=b.tvtype
    order by a.report_end_wk desc;

Answer 1

我看到了许多我可以给您的优化建议。我不确定我是否有时间在示例中组装它们。

使用common table expressions-所谓的WITH语句而不是临时表。临时表的运行速度可能更快，但是Redshift应该足够快以处理聚合（在您的情况下为10或1亿行）。

您将有一个查询，并将每个临时表步骤声明为表变量：
```
WITH
    my_first_table as (SELECT ... ),
    my_second_table as (SELECT ... FROM my_first_table ),
    my_third_table as (SELECT ... FROM my_second_table )
SELECT 
    ...
FROM any_of_the_above_declared_tables
```

使用DATE_TRUNC生成不同的日期粒度：

SELECT DATE_TRUNC('month', '2019-08-14'::DATE);  -- will return 2019-08-01 
SELECT DATE_TRUNC('week', '2019-08-14'::DATE);  -- will return 2019-08-12

使用TO_CHAR来获取日历周：

select to_char('2019-08-14'::DATE, 'WW');  -- returns 33

使用SUM(CASE WHEN date_condition THEN value END)来获取一段时间内的总和可能会更容易-但这取决于您构造转换的方式

如何在Redshift中优化此方法？

1 个答案: