SQL:Bucket计算结果

时间:2017-05-10 20:23:42

标签: sql aggregate-functions teradata

我从包含交易数据的表中提取数据,并希望通过平均交易规模和帐户获取数据桶的结果,然后作为显示帐户数量,交易数量和交易大小和平均交易数量的列尺寸。基本上是这样的:

**raw data**                    
date        acct_nr    trans_am         
1/3/2017    1234       400          
1/20/2017   1234       700          
1/22/2017   1234       1100
1/22/2017   2345       300
1/23/2017   2345       800
1/24/2017   3456       1500
1/25/2017   4567       250
1/25/2017   4567       300
1/26/2017   4567       350

**current results**                 
month   tier            acct_ct trans_ct    trans_am    trans_avg
201701  a. >=250 <500   3       5           1600        320
201701  b. >=500 <1000  2       2           1500        750
201701  c. >=1000 <1500 2       2           2600        1300

**expected results**                    
month   tier            acct_ct trans_ct    trans_am    trans_avg (this column should be they key for bucketing, per account)
201701  a. >=250 <500   1       3           900         300
201701  b. >=500 <1000  2       5           3300        660
201701  c. >=1000 <1500 1       1           1500        1500

目前这是我正在使用的脚本,它给了我当前结果而不是预期结果

select
  cldr.year_month
  ,case
    when tran.tran_am >= 0 and tran.tran_am < 100 then 'a. >=0 <100'
    when tran.tran_am >= 100 and tran.tran_am < 250 then 'b. >=100 <250'
    when tran.tran_am >= 250 and tran.tran_am < 500 then 'c. >=250 <500'
    when tran.tran_am >= 500 and tran.tran_am < 1000 then 'd. >=500 <1000'
    when tran.tran_am >= 1000 and tran.tran_am < 1500 then 'e. >=1000 <1500'
    when tran.tran_am >= 1500 and tran.tran_am < 2000 then 'f. >=1500 <2000'
    when tran.tran_am >= 2000 and tran.tran_am < 2500 then 'g. >=2000 <2500'
    when tran.tran_am >= 2500 and tran.tran_am < 5000 then 'h. >=2500 <5000'
    when tran.tran_am >= 5000 and tran.tran_am < 10000 then 'i. >=5000 <10000'
    when tran.tran_am >= 10000 then 'j. >=10000'
    else 'z. other'
    end as trans_am_tier
  ,count(distinct tran.acct_id) as acct_ct
  ,sum(tran.tran_am) as trans_am
  ,count(tran.tran_id) as trans_ct
  ,(trans_am / trans_ct) as trans_avg

  from reports.tran as tran

  inner join reports.date as cldr on cldr.calendar_date=tran.tran_eff_dt
  inner join reports.acct as acct on tran.acct_id=acct.acct_id

  where tran.ext_tran_cd in ('ACHDD','ACHID','ACHRDD')
  and tran.tran_eff_dt between '2017-01-01' and '2017-04-30'
  and tran.prod_type = '4400'
  and acct.acct_stat <> 4
  and acct.dp_cust_nbr NOT IN (1007,1101)

  group by 1,2
  order by 1,2

我知道这与我忙于 tran.trans_am ,而不是 trans_avg 这一事实有关。这是通过使用子查询来实现的吗?基本上先计算 trans_avg 然后再计算?不知道我会怎么做。

基本上,结果应该是“对于每个帐号,计算交易数量并平均这些交易的交易金额。然后,根据该平均交易金额,将该帐号与相关的交易计数和平均交易规模放入其中一个已定义的存储桶,然后将每个存储桶的总帐户数加起来“。因此,结果应按帐户和事务层进行分组,并且应根据 trans_avg 确定分组。

顺便说一下,我是一名分析师,只对DBMS有读取权限。无法创建临时表或任何这些东西。

编辑添加到原始数据,当前结果和预期结果中,以阐明我想要实现的目标。

2 个答案:

答案 0 :(得分:1)

您确定所需的方法是首先汇总数据,然后根据trans_avg而不是tran_am将汇总的记录分配到层。您还可以通过子查询来实现此目的,如下所示:

-- Create sample data.
create table [tran]
(
    tran_id bigint,
    acct_id bigint,
    tran_am bigint,
    tran_eff_dt date
);
insert [tran] values
    (1, 1234, 400, '20170103'),
    (2, 1234, 700, '20170120'),
    (3, 1234, 1100, '20170122');

create table calendar
(
    calendar_date date,
    year_month char(6)
);
insert calendar values
    ('20170103', '201701'),
    ('20170120', '201701'),
    ('20170122', '201701');

-- Aggregate transactions first, then assign to a tier.
select
    TransactionsByMonth.year_month,
    case
        when TransactionsByMonth.trans_avg >= 0 and TransactionsByMonth.trans_avg < 100 then 'a. >=0 <100'
        when TransactionsByMonth.trans_avg >= 100 and TransactionsByMonth.trans_avg < 250 then 'b. >=100 <250'
        when TransactionsByMonth.trans_avg >= 250 and TransactionsByMonth.trans_avg < 500 then 'c. >=250 <500'
        when TransactionsByMonth.trans_avg >= 500 and TransactionsByMonth.trans_avg < 1000 then 'd. >=500 <1000'
        when TransactionsByMonth.trans_avg >= 1000 and TransactionsByMonth.trans_avg < 1500 then 'e. >=1000 <1500'
        when TransactionsByMonth.trans_avg >= 1500 and TransactionsByMonth.trans_avg < 2000 then 'f. >=1500 <2000'
        when TransactionsByMonth.trans_avg >= 2000 and TransactionsByMonth.trans_avg < 2500 then 'g. >=2000 <2500'
        when TransactionsByMonth.trans_avg >= 2500 and TransactionsByMonth.trans_avg < 5000 then 'h. >=2500 <5000'
        when TransactionsByMonth.trans_avg >= 5000 and TransactionsByMonth.trans_avg < 10000 then 'i. >=5000 <10000'
        when TransactionsByMonth.trans_avg >= 10000 then 'j. >=10000'
        else 'z. other'
    end as trans_am_tier,
    TransactionsByMonth.acct_ct,
    TransactionsByMonth.trans_am,
    TransactionsByMonth.trans_ct,
    TransactionsByMonth.trans_avg
from
    (
        select
            calendar.year_month,
            count(distinct [tran].acct_id) as acct_ct,
            sum([tran].tran_am) as trans_am,
            count([tran].tran_id) as trans_ct,
            sum([tran].tran_am) / count([tran].tran_id) as trans_avg
        from
            [tran]
            inner join calendar on [tran].tran_eff_dt = calendar.calendar_date
        group by
            calendar.year_month
    ) TransactionsByMonth;

请注意,为了简化重新创建数据集的任务,我省略了原始查询中的一些连接和WHERE子句表达式。我还更改了trans_avg列的定义,因为我的DBMS不允许我根据前面列出的别名定义SELECT列表中的一个元素。 (我没有Teradata。)

另一种选择是使用common table expression或CTE。虽然你可以用CTE做一些你可以用子查询做的事情(比如创建一个递归查询),但在这种情况下,它只是一个品味问题。我更喜欢CTE因为我发现它们更容易阅读,特别是在需要倍数的情况下;多个嵌套子查询匆忙让人困惑。这是CTE方法的样子:

with TransactionsByMonth as
(
    select
        calendar.year_month,
        count(distinct [tran].acct_id) as acct_ct,
        sum([tran].tran_am) as trans_am,
        count([tran].tran_id) as trans_ct,
        sum([tran].tran_am) / count([tran].tran_id) as trans_avg
    from
        [tran]
        inner join calendar on [tran].tran_eff_dt = calendar.calendar_date
    group by
        calendar.year_month
)
select
    TransactionsByMonth.year_month,
    case
        when TransactionsByMonth.trans_avg >= 0 and TransactionsByMonth.trans_avg < 100 then 'a. >=0 <100'
        when TransactionsByMonth.trans_avg >= 100 and TransactionsByMonth.trans_avg < 250 then 'b. >=100 <250'
        when TransactionsByMonth.trans_avg >= 250 and TransactionsByMonth.trans_avg < 500 then 'c. >=250 <500'
        when TransactionsByMonth.trans_avg >= 500 and TransactionsByMonth.trans_avg < 1000 then 'd. >=500 <1000'
        when TransactionsByMonth.trans_avg >= 1000 and TransactionsByMonth.trans_avg < 1500 then 'e. >=1000 <1500'
        when TransactionsByMonth.trans_avg >= 1500 and TransactionsByMonth.trans_avg < 2000 then 'f. >=1500 <2000'
        when TransactionsByMonth.trans_avg >= 2000 and TransactionsByMonth.trans_avg < 2500 then 'g. >=2000 <2500'
        when TransactionsByMonth.trans_avg >= 2500 and TransactionsByMonth.trans_avg < 5000 then 'h. >=2500 <5000'
        when TransactionsByMonth.trans_avg >= 5000 and TransactionsByMonth.trans_avg < 10000 then 'i. >=5000 <10000'
        when TransactionsByMonth.trans_avg >= 10000 then 'j. >=10000'
        else 'z. other'
    end as trans_am_tier,
    TransactionsByMonth.acct_ct,
    TransactionsByMonth.trans_am,
    TransactionsByMonth.trans_ct,
    TransactionsByMonth.trans_avg
from
    TransactionsByMonth;

正如我所提到的,我没有安装Teradata,但我认为这里的所有内容都应该是标准的SQL,所以希望它对您有用,或者至少引导您朝着正确的方向前进。

答案 1 :(得分:1)

根据您的旁白,您需要先计算每个帐户的平均值(使用派生表或CTE),然后计算每个层的行数:

select
/*Then, based on that averaged transaction amount, place that account number with associated transaction count and average transaction size into one of the defined buckets, and then sum the total number of accounts per bucket*/
  cldr.year_month
  ,case -- no need to repeat the lower limit
    when trans_avg >= 0 and trans_avg < 100 then 'a. >=0 <100'
    when trans_avg < 250 then 'b. >=100 <250'
    when trans_avg < 500 then 'c. >=250 <500'
    when trans_avg < 1000 then 'd. >=500 <1000'
    when trans_avg < 1500 then 'e. >=1000 <1500'
    when trans_avg < 2000 then 'f. >=1500 <2000'
    when trans_avg < 2500 then 'g. >=2000 <2500'
    when trans_avg < 5000 then 'h. >=2500 <5000'
    when trans_avg < 10000 then 'i. >=5000 <10000'
    when trans_avg >= 10000 then 'j. >=10000'
    else 'z. other' -- this can only happen for trans_avg < 0
    end as trans_am_tier
   ,count(*)
   ,Sum(trans_ct)
   ,Sum(trans_am)
from
 (
    select
    /*for every account number, count # of transactions and average the transaction amount for those transactions
    */
       cldr.year_month
      ,acct.acct_id
      ,sum(tran.tran_am) as trans_am
      ,count(tran.tran_id) as trans_ct
      ,(trans_am / trans_ct) as trans_avg -- why not a simple avg(trans_am)??
    from reports.tran as tran

      inner join reports.date as cldr on cldr.calendar_date=tran.tran_eff_dt
      inner join reports.acct as acct on tran.acct_id=acct.acct_id

    where tran.ext_tran_cd in ('ACHDD','ACHID','ACHRDD')
      and tran.tran_eff_dt between '2017-01-01' and '2017-04-30'
      and tran.prod_type = '4400'
      and acct.acct_stat <> 4
      and acct.dp_cust_nbr NOT IN (1007,1101)

    group by 1,2
 ) as dt
group by 1,2
order by 1,2