我需要使用SQL Server平衡数据集以进行二元逻辑回归项目。不平衡的数据大约是10:90%。我如何建议平衡sql server中的数据?
答案 0 :(得分:0)
这是一种方法:
select t.*
from (select t.*,
row_number() over (partition by target order by newid()) as seqnum,
sum(case when target = 0 then 1 else 0 end) over () as num_0,
sum(case when target = 1 then 1 else 0 end) over () as num_1
from t
) t
where (num_0 <= num_1 and seqnum <= num_0) or
(num_1 < num_0 and seqnum <= num_1);
这会使目标的每个值随机化行。它为较稀有的目标提取所有行,为更常见的目标提取相同大小的随机样本。