从大型数据集中随机抽样

时间:2012-11-29 14:04:11

标签: sql sql-server sql-server-2008 tsql

我有一个大型数据库,我从中提取了一个研究人群。为了进行比较,我想选择具有类似特征的控制组。关于我想要匹配的两个标准是年龄和性别。查询为我提供了我想要匹配的数字

select sex, age/10 as decades,COUNT(*) as counts
        from
        (
            select distinct m.patid
                ,m.sex,DATEPART(year,min(c.admitdate)) -m.yrdob as Age
                from members as m
                inner join claims as c on c.patid=m.PATID
                group by m.PATID, m.sex,m.yrdob
        )x group by sex, Age/10

结果集看起来像

enter image description here

这个时代的十年专栏由表达式

给出
(DATEPART(year,min(c.admitdate)) -m.yrdob)/10 

这用于使用整数除法查找年龄范围为20-29,30-39等的人。例如,我想从一个更大的数据集中选择507名20多岁的女性。查找较大数据集特征的查询是

select distinct m.patid
        ,m.sex
        ,(DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
        from members as m
        inner join claims as c on c.patid=m.PATID
        group by m.PATID, m.sex,m.yrdob

编辑:第二次查询的结果 enter image description here

因此,我需要在第二个查询中将数十年的sum列与第一个查询中的counts相等。我尝试了(并返回零结果)如下。我需要做些什么来匹配这些年龄?

运行的查询,但不返回任何结果:

select x.PATID--,x.sex,x.decades,y.counts
    from
    (

    select distinct m.patid
        ,m.sex
        ,(DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
        from members as m
        inner join claims as c on c.patid=m.PATID
        group by m.PATID, m.sex,m.yrdob
    ) as x 
    inner join 
    (

        select sex, age/10 as decades,COUNT(*) as counts
        from
        (
            select distinct m.patid
                ,m.sex,DATEPART(year,min(c.admitdate)) -m.yrdob as Age
                from members as m
                inner join claims as c on c.patid=m.PATID
                group by m.PATID, m.sex,m.yrdob
        )x group by sex, Age/10
    ) as y on x.sex=y.sex and x.decades=y.decades
    group by y.counts,x.PATID,x.sex,y.sex
    having SUM(x.decades)=y.counts and x.sex=y.sex

1 个答案:

答案 0 :(得分:1)

select
   T1.sex,
   T1.decades,
   T1.counts,
   T2.patid

from (

   select 
      sex, 
      age/10 as decades,
      COUNT(*) as counts
   from (

      select  m.patid,
         m.sex,
         DATEPART(year,min(c.admitdate)) -m.yrdob as Age
      from members as m
      inner join claims as c on c.patid=m.PATID
      group by m.PATID, m.sex,m.yrdob
   )x 
   group by sex, Age/10
) as T1
join (
   --right here is where the random sampling occurs
    SELECT TOP 50--this is the total number of peolpe in our dataset
      patid
      ,sex
      ,decades

   from (
      select  m.patid,
         m.sex,
         (DATEPART(year,min(c.admitdate)) -m.yrdob)/10 as decades
      from members as m
      inner join claims as c on c.patid=m.PATID
      group by m.PATID, m.sex, m.yrdob

   ) T2
      order by NEWID()
) as T2
on T2.sex = T1.sex
and T2.decades = T1.decades 
编辑:我发布了另一个与此相似的问题,其中我发现我的结果实际上并不是随机的,但它们只是前N个结果。我在最外层的查询中按newid()排序,所有正在进行的操作都是在完全相同的结果集周围进行的。对于现已结束的问题,我发现我需要在上述查询的注释行中使用TOP关键字和order by newid()