Question

业务环境： 简而言之，我希望将来分解数据集以测试和控制并运行一些广告系列。但是，当我尝试拆分它时，似乎涉及很多复杂性。

数据：

RetailerCode    CID        Segment
A6005         13SVC15       High
A6005         19VDE1F       Low
A6005         1B3BD1F       Medium
A6005         1B3HB48       Medium
A6005         1B3HB49       Low
A9006         1B3HB40       High
A9006         1B3HB41       High
A9006         1B3HB43       Low
A9006         1B3HB46       Medium

我有一个像这样的主表，其中包含客户及其零售商，细分市场等的列表（它有30多个列，但我不想在这里显示所有内容）。我想通过添加一列来拆分此数据集，比如说“ test_control”，它将告诉我们特定的行是测试类别还是控件类别。我可以随机拆分，但是我需要遵循以下规则，

-   If a customer is tied to two or more retailers, then he should be in control group
-   Each retailer will be provided with list of customers to target for the campaigns and the retailer will run the campaign. Here, 
o   Test-Control split should be done at Retailer level and then at segment level. For example,  For each retailer
   10% of their High customers to control and remaining 90% of their high customers to test.
   10% of their Medium customers to control and remaining 90% of their Medium customers to test
   10% of their Low customers to control and remaining 90% of their Low customers to test.

我可以共享我编写的查询，但这会让你们完全困惑

但这是我遵循的高级方法

-   Took out the records from the main table where customer occurred more than once and added a “test_control” column with “control” in it and saved it in a separate temp table 1
-   Took out the records from the main table where customer occurred only once and saved it in a temp table 2
-   On table 2, I tried to split to 10% (control) and 90% (test)
-   Used union all to merge table 1 and table 2. But the ouput is totally wrong

预期产量

如您所见，它在细分受众群级别上也分为10％和90％

请有人帮助我。过去几天一直在努力，但是没有运气

预先感谢

我写的查询

--Tagging each row whether it is repated more than twice or not
select * into #Repeat from (
SELECT CID, Count(*) as number,
case when Count(*)>1 then '1'
 else '0'
end as repeat
FROM #Target GROUP BY CID
) temp


--Joining the above table to the master table and creating a new table
SELECT * into #T from(
select a.*,b.repeat from #Target a
left join #Repeat b ON a.CID = b.CID
) temp
alter table #T
add t_c Varchar(400)
UPDATE #T 
SET     t_c =  CASE  
                        WHEN repeat = '1' THEN 'control' 
                        ELSE repeat
                    END
--creating a sub table which has no repetitive customer
SELECT * into #T1
from #T where t_c <> 'control'

--creating a sub table which has repetitive customer
select * from #T
SELECT * into #T2
from #T where t_c = 'control'



--splitting the table(#T1) into test and control 
select * into #T3 from
(select *, 
       (case 
             when row_number() over (partition by RetailerCode,SEGMENT order by newid()) <= ((1-0.9-((select count(*) from #Target)/(select count(*) from #T2))) * count(*) over (partition by RetailerCode,SEGMENT)) then 'control'
             else 'test'
        end) as t_c_new
from #T1
where RETAILERCODE IN (SELECT DISTINCT RETAILERCODE FROM #Target
WHERE CID IN(
SELECT CID FROM
(SELECT  CID, COUNT(*) AS NoOfOccurrences
FROM #Target GROUP BY CID
HAVING COUNT(*) = 1 ) temp
))
)temp1


--renaming a column 
EXEC tempdb.sys.sp_rename N'#T2.t_c', N't_c_new', N'COLUMN';

ALTER TABLE #T3 DROP COLUMN t_c

--Merging the output 
select * into #T4 from
(SELECT * from #T3 --1085612
union all
select * from #T2 -- 89622
) temp

--QA check…this is where I found out my logic is wrong

select RetailerCode, t_c_new, Segment
 from
#T4 group by RetailerCode, t_c_new, Segment
order by RetailerCode, t_c_new, Segment

Answer 1

如果要将客户分配到特定组，则应先订购他们：

SELECT RetailerCode, CID, Segment,
   CASE WHEN Percent_Rank()
              Over (PARTITION BY retailercode, segment -- for each retailer/segment
                    ORDER BY ControlGroup, newid()     -- all customers with multiple retailers are sorted low, i.e. will be in control group (if it's less than 10%)
                   ) <= 0.1 
             THEN 'control'
             ELSE 'test'
   END AS GROUP
FROM
 (
   SELECT t.*,
      -- flag customers to be put in control group
      CASE WHEN Min(RetailerCode) Over (PARTITION BY CID)
              = Max(RetailerCode) Over (PARTITION BY CID)
           THEN 1 -- only a single retailer
           ELSE 0 -- multiple retailers 
      END AS ControlGroup
-- if the RetailerCode/CID combination is unique:
--      CASE WHEN Count(*) Over (PARTITION BY CID) = 1
--           THEN 1 -- only a single retailer
--           ELSE 0 -- multiple retailers 
--      END AS ControlGroup
   FROM tab t
 ) AS dt;

Answer 2

我假设与多个零售商有业务的客户不计入对照组的10％。您可以使用窗口功能找到它们。

然后，以下内容将每个零售商和细分市场中的其余客户划分为适当大小的组：

select RetailerCode, CID, Segment,
       (case when auto_in_control = 1 then 'control'
             when row_number() over (partition by retailercode, segment, auto_in_control order by newid()) <= 
                  0.1 * count(*) over (partition by retailercode, segment, auto_in_control) 
             then 'control'
             else 'test'
        end) as group
from (select t.*,
             min(RetailerCode) over (partition by cid) as min_rc,
             max(RetailerCode) over (partition by cid) as max_rc,
             (case when min_rc = max_rc then 0 else 1 end) as auto_in_control
      from Table t
     ) t
Order by RetailerCode;

如果您希望每个零售商控制10％的客户，并优先考虑多个零售商中的客户，则：

select RetailerCode, CID, Segment,
       (case when row_number() over (partition by retailercode, segmentorder by auto_in_control desc, newid()) <= 
                  0.1 * count(*) over (partition by retailercode, segment) 
             then 'control'
             else 'test'
        end) as group
from (select t.*,
             min(RetailerCode) over (partition by cid) as min_rc,
             max(RetailerCode) over (partition by cid) as max_rc,
             (case when min_rc = max_rc then 0 else 1 end) as auto_in_control
      from Table t
     ) t
Order by RetailerCode;

如果给定细分市场中给定零售商的客户中有10％以上拥有多个零售商，则只有10％的客户会被选中。其余的此类客户将在测试组中。

顺便说一句，这是不良的测试设计。最好分配客户对其零售商进行独立测试/控制。更重要的是，对照组和测试组没有代表性，因此我不想尝试解释此类测试的结果。好吧，你（一次）可以付给我足够的钱；）

测试（90％）和控制（10％）在零售商级别以及SQL的细分级别进行了拆分。怎么做？

2 个答案: