业务环境: 简而言之,我希望将来分解数据集以测试和控制并运行一些广告系列。但是,当我尝试拆分它时,似乎涉及很多复杂性。
数据:
RetailerCode CID Segment
A6005 13SVC15 High
A6005 19VDE1F Low
A6005 1B3BD1F Medium
A6005 1B3HB48 Medium
A6005 1B3HB49 Low
A9006 1B3HB40 High
A9006 1B3HB41 High
A9006 1B3HB43 Low
A9006 1B3HB46 Medium
我有一个像这样的主表,其中包含客户及其零售商,细分市场等的列表(它有30多个列,但我不想在这里显示所有内容)。我想通过添加一列来拆分此数据集,比如说“ test_control”,它将告诉我们特定的行是测试类别还是控件类别。我可以随机拆分,但是我需要遵循以下规则,
- If a customer is tied to two or more retailers, then he should be in control group
- Each retailer will be provided with list of customers to target for the campaigns and the retailer will run the campaign. Here,
o Test-Control split should be done at Retailer level and then at segment level. For example, For each retailer
10% of their High customers to control and remaining 90% of their high customers to test.
10% of their Medium customers to control and remaining 90% of their Medium customers to test
10% of their Low customers to control and remaining 90% of their Low customers to test.
我可以共享我编写的查询,但这会让你们完全困惑
但这是我遵循的高级方法
- Took out the records from the main table where customer occurred more than once and added a “test_control” column with “control” in it and saved it in a separate temp table 1
- Took out the records from the main table where customer occurred only once and saved it in a temp table 2
- On table 2, I tried to split to 10% (control) and 90% (test)
- Used union all to merge table 1 and table 2. But the ouput is totally wrong
预期产量
如您所见,它在细分受众群级别上也分为10%和90%
请有人帮助我。过去几天一直在努力,但是没有运气
预先感谢
我写的查询
--Tagging each row whether it is repated more than twice or not
select * into #Repeat from (
SELECT CID, Count(*) as number,
case when Count(*)>1 then '1'
else '0'
end as repeat
FROM #Target GROUP BY CID
) temp
--Joining the above table to the master table and creating a new table
SELECT * into #T from(
select a.*,b.repeat from #Target a
left join #Repeat b ON a.CID = b.CID
) temp
alter table #T
add t_c Varchar(400)
UPDATE #T
SET t_c = CASE
WHEN repeat = '1' THEN 'control'
ELSE repeat
END
--creating a sub table which has no repetitive customer
SELECT * into #T1
from #T where t_c <> 'control'
--creating a sub table which has repetitive customer
select * from #T
SELECT * into #T2
from #T where t_c = 'control'
--splitting the table(#T1) into test and control
select * into #T3 from
(select *,
(case
when row_number() over (partition by RetailerCode,SEGMENT order by newid()) <= ((1-0.9-((select count(*) from #Target)/(select count(*) from #T2))) * count(*) over (partition by RetailerCode,SEGMENT)) then 'control'
else 'test'
end) as t_c_new
from #T1
where RETAILERCODE IN (SELECT DISTINCT RETAILERCODE FROM #Target
WHERE CID IN(
SELECT CID FROM
(SELECT CID, COUNT(*) AS NoOfOccurrences
FROM #Target GROUP BY CID
HAVING COUNT(*) = 1 ) temp
))
)temp1
--renaming a column
EXEC tempdb.sys.sp_rename N'#T2.t_c', N't_c_new', N'COLUMN';
ALTER TABLE #T3 DROP COLUMN t_c
--Merging the output
select * into #T4 from
(SELECT * from #T3 --1085612
union all
select * from #T2 -- 89622
) temp
--QA check…this is where I found out my logic is wrong
select RetailerCode, t_c_new, Segment
from
#T4 group by RetailerCode, t_c_new, Segment
order by RetailerCode, t_c_new, Segment
答案 0 :(得分:2)
如果要将客户分配到特定组,则应先订购他们:
SELECT RetailerCode, CID, Segment,
CASE WHEN Percent_Rank()
Over (PARTITION BY retailercode, segment -- for each retailer/segment
ORDER BY ControlGroup, newid() -- all customers with multiple retailers are sorted low, i.e. will be in control group (if it's less than 10%)
) <= 0.1
THEN 'control'
ELSE 'test'
END AS GROUP
FROM
(
SELECT t.*,
-- flag customers to be put in control group
CASE WHEN Min(RetailerCode) Over (PARTITION BY CID)
= Max(RetailerCode) Over (PARTITION BY CID)
THEN 1 -- only a single retailer
ELSE 0 -- multiple retailers
END AS ControlGroup
-- if the RetailerCode/CID combination is unique:
-- CASE WHEN Count(*) Over (PARTITION BY CID) = 1
-- THEN 1 -- only a single retailer
-- ELSE 0 -- multiple retailers
-- END AS ControlGroup
FROM tab t
) AS dt;
答案 1 :(得分:0)
我假设与多个零售商有业务的客户不计入对照组的10%。您可以使用窗口功能找到它们。
然后,以下内容将每个零售商和细分市场中的其余客户划分为适当大小的组:
select RetailerCode, CID, Segment,
(case when auto_in_control = 1 then 'control'
when row_number() over (partition by retailercode, segment, auto_in_control order by newid()) <=
0.1 * count(*) over (partition by retailercode, segment, auto_in_control)
then 'control'
else 'test'
end) as group
from (select t.*,
min(RetailerCode) over (partition by cid) as min_rc,
max(RetailerCode) over (partition by cid) as max_rc,
(case when min_rc = max_rc then 0 else 1 end) as auto_in_control
from Table t
) t
Order by RetailerCode;
如果您希望每个零售商控制10%的客户,并优先考虑多个零售商中的客户,则:
select RetailerCode, CID, Segment,
(case when row_number() over (partition by retailercode, segmentorder by auto_in_control desc, newid()) <=
0.1 * count(*) over (partition by retailercode, segment)
then 'control'
else 'test'
end) as group
from (select t.*,
min(RetailerCode) over (partition by cid) as min_rc,
max(RetailerCode) over (partition by cid) as max_rc,
(case when min_rc = max_rc then 0 else 1 end) as auto_in_control
from Table t
) t
Order by RetailerCode;
如果给定细分市场中给定零售商的客户中有10%以上拥有多个零售商,则只有10%的客户会被选中。其余的此类客户将在测试组中。
顺便说一句,这是不良的测试设计。最好分配客户对其零售商进行独立测试/控制。更重要的是,对照组和测试组没有代表性,因此我不想尝试解释此类测试的结果。好吧,你(一次)可以付给我足够的钱;)