如何在sql中为测试组找到类似的分布式控制组?

时间:2018-05-22 08:01:49

标签: sql statistics teradata

这项任务非常雄心勃勃。我每个月有大约5万多用户,我希望从完整的用户群中匹配同样大小的控制组,大约有50万用户。为了得到类似的分布,我有一些分类和数字特征。分类功能仅为inner joined。数字特征想要整理,但这是最大的问题。

这是我的代码:

with pl_subs as( -- in the cte I 
    select   al.*
  ,ROW_NUMBER() OVER(PARTITION BY 
      al.device_type
     ,al.report_mnth
     ,round(al.days_to_LAST_FLASH_DTTM, -1)
     ,round(al.LT_month, -1)
     ,round(al.REVC, -1)
     ,round(al.usg_in, -2)
     ,round(al.usg_AC, -1)
     ORDER BY null) AS RN
    from ai_pl_SUBS test_gr
    inner join  ai_SUBS_MONTH_CLR al 
    on al.cust_id = test_gr.cust_id
    and al.report_mnth = test_gr.REGISTERED_mnth
    where al.report_mnth  = '2017-11' and test_gr.REGISTERED_mnth = '2017-11'
)
sel count(1) -- just to count from (
sel al.cust_id, pl_subs.rn rn_pl
 ,ROW_NUMBER() OVER(PARTITION BY 
  pl_subs.device_type
 ,pl_subs.report_mnth
 ,pl_subs.MCID
 ,round(pl_subs.days_to_LF, -1)
 ,round(pl_subs.LT_month, -1)
 ,round(pl_subs.REVC, -1)
 ,round(pl_subs.usg_in, -2)
 ,round(pl_subs.usg_AC, -1)
 ORDER BY null) AS RN
from pl_subs
inner join ai_SUBS_MONTH_CLR al on 

-- 2 categorilal features
pl_subs.device_type =  al.device_type
and pl_subs.report_mnth = al.report_mnth

-- 5 numerical features
and round(pl_subs.days_to_LF, -1) = Round(al.days_to_LF, -1)
and round(pl_subs.LT_month, -1) = Round(al.LT_month, -1)
and round(pl_subs.REVC, -1) = Round(al.REVC, -1)
and round(pl_subs.usg_in, -2) = Round(al.usg_in, -2) 
and round(pl_subs.usg_AC, -1) = Round(al.usg_AC, -1) 
-- in the control group shouldnot be any cust_id from the test group
where al.cust_id not in (select cust_id from ai_pl_SUBS)
    and al.report_mnth = '2017-11'
    ) _out where rn <=  rn_pl 
-- each 7 features determines strata. So I need to have so many cust as I have in appropriate  strata in the test group

测试组中的人具有更高的数值。在上面的代码我舍入到数十,所以中间线轴不会太大,但因为我只有36k用户,而不是预期的50k。我向上舍入2 - 查询将因线轴问题而失败

类似的分布式 - 具有相等的数值平均值

我有任何代码错误吗?如何将代码修改为可以将客户多次包括在分层中?

1 个答案:

答案 0 :(得分:1)

我上面的代码有一些问题:

1)round(pl_subs.LT_month , -1) = Round(al.LT_month , -1) - 对广泛分布的值使用轮次最终会导致找到探针控制客户端来测试一个问题。所以只是用例:

case when LT_month <= 4 then '0'
     when LT_month <= 8 then '1'
     when LT_month <= 12 then '2'
     when LT_month <= 17 then '3'
     when LT_month <= 24 then '4'
     when LT_month <= 36 then '5'
     when LT_month <= 56 then '6'
     when LT_month <= 83 then '7'
     when LT_month <= 96 then '8'

预先计算和使用索引将使查询运行得非常快。但是不要过分

2)CTE应仅包含分层+该组应包含的人数:

 with pl_subs as( -- in the cte I 
        select  
          al.device_type
         ,al.report_mnth
-- rounds should be changed
         ,round(al.LT_month, -1)
         ,round(al.REVC, -1)
         ,round(al.usg_in, -2)
         ,round(al.usg_AC, -1)
, count(1) as rn
from ai_pl_SUBS test_gr
    inner join  ai_SUBS_MONTH_CLR al 
    on al.cust_id = test_gr.cust_id
    and al.report_mnth = test_gr.REGISTERED_mnth
    where al.report_mnth  = '2017-11' and test_gr.REGISTERED_mnth = '2017-11'
group by 1
)

sel subs_id, report_mnth from (
sel al.subs_id, al.report_mnth, pl_subs.max_rn max_rn
 ,ROW_NUMBER() OVER(PARTITION BY 
  pl_subs.device_type
 ,pl_subs.report_mnth
 ,pl_subs.segment
 ORDER BY null) AS RN
from pl_subs
inner join UAT_DM.ai_SUBS_MONTH_CLR al on 
pl_subs.device_type =  al.device_type
and pl_subs.report_mnth = al.report_mnth
and pl_subs.segment = al.segment
where al.subs_id not in (select subs_id from UAT_DM.ai_pl_SUBS)
    and al.report_mnth = '2017-11'

) _out where rn <=  max_rn;