我正在寻找一些天才SQL帮助,我遇到了棘手的统计问题。
我要做的是从一组不平衡的用户配置文件中提取统计平衡的样本。一次为单个配置文件属性(例如性别)执行此操作会稍微简单一些。但是,同时跨多个维度进行此操作需要一些复杂性。
为了论证,让我说我有这张表。
Profile.userID
Profile.Gender
Profile.Age
Profile.Income
如果我想从混合中提取一组配置文件,以便用户的新抽样大致匹配以下所有特征:
50% male, 50% female
30% young, 40% middle age, 40% old
40% low income, 40% middle income, 20% high income
有没有人对如何解决此问题有任何想法?
答案 0 :(得分:3)
你所拥有的是抽样问题。他们解决这个问题的关键是将数据分成三个变量组合的单独组。然后,计算每组边际概率的乘积(您的值是边际概率)。然后,对所有18个组进行标准化。
例如,男性 - 年轻 - 低组的值将为0.5 * 0.3 * 0.4 = 0.06。您对所有18个组重复此操作,然后将其标准化为百分比(即,将每个值除以所有值的总和)。结果如下:
Gender Age Income Marg Normalized
Male Young Low 0.06 5.5%
Male Young Middle 0.06 5.5%
Male Young High 0.03 2.7%
Male Middle Low 0.08 7.3%
Male Middle Middle 0.08 7.3%
Male Middle High 0.04 3.6%
Male Old Low 0.08 7.3%
Male Old Middle 0.08 7.3%
Male Old High 0.04 3.6%
Female Young Low 0.06 5.5%
Female Young Middle 0.06 5.5%
Female Young High 0.03 2.7%
Female Middle Low 0.08 7.3%
Female Middle Middle 0.08 7.3%
Female Middle High 0.04 3.6%
Female Old Low 0.08 7.3%
Female Old Middle 0.08 7.3%
Female Old High 0.04 3.6%
然后这将成为每组的采样率。这是实际进行采样的伪SQL代码:
with SamplingRates (
select 'Male' as gender, 'Young' as Age, 'Low' as income, 0.045 as SamplingRate,
union all . .
)
select t.*
from (select t.*,
row_number() over (partition by gender, age, income order by <random>) as seqnum,
count(*) over (partition by gender, age, income) as NumRecs
from table t
) t join
SampleRates sr
on t.gender = sr.gender and t.age = sr.age and t.income = sr.income and
seqnum <= sr.SamplingRate * NumRecs
答案 1 :(得分:0)
以下是我如何去做,假设: 30%年轻,40%中年,30%年龄
采用最小公分母,您的泳池大小= 5x5x3x4x2x4 = 2400
您有18个查询将池填充到TEMP TABLE中。重复所有18个查询以获得更大的池。下面是理想池的分布情况以及每个查询的外观。您还可以在每个查询中引入一些随机性。有一篇关于这样做的帖子。
这可能不那么优雅,但应该产生一个平衡的游泳池。
您在伪代码中的第一个查询看起来像:
SELECT * INTO TEMP TABLE
WHERE male, young, high income and ID NOT IN TEMP TABLE
LIMIT RECORD SET 72
依此类推。希望能帮助到你。好问题。
CREATE TEMP TABLE
480 high income
144 young
72 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
72 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
192 middle age
96 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 96]
96 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 96]
144 old
72 males [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
72 females [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 72]
960 middle income
288 young
144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
384 middle age
192 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
192 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
288 old
144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
960 low income
288 young
144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
384 middle age
192 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
192 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 192]
288 old
144 male [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]
144 female [SELECT THIS INTO TEMP TABLE WHERE ID NOT IN TEMP TABLE LIMIT 144]