Question

这里我描述一个抽象的案例，但它类似于我现在试图解决的案例。我知道如何使用PL / SQL块获得粗略结果，但我想知道是否有人可以使用单个选择查询来建议解决方案。

假设我们有一个表t_people，其中有数千条记录描述了一组具有以下属性集的人：

id
age，号码
height in cm，number
gender，varchar2（'male'或'female'）

我们需要提取N条记录，以便结果集符合以下条件：

30％的选定人员高于180厘米
60％的被选人是男性
40％的被选人年龄超过40岁

我们也可以假设N远小于表中的总行数，问题是可以解决的。

您如何建议使用单个选择查询执行此操作？

由于

Answer 1

您可以将数据分为8组，然后从每组中取出比例样本以满足您的要求。一种粗略的方法是将条件转换为组，例如：

高于180的300人，不是男性，不是年长
100人短，不是男性，不是年长
400人短，男性，年长
200人短，男性，不年长

然后你可以解决这个问题：

with p as (
      select p.*,
             row_number() over (partition by height, male, age order by height) as seqnum
      from (select p.*,
                   (case when height > 180 then 1 else 0 end) as height,
                   (case when gender = 'male' then 1 else 0 end) as male,
                   (case when age > 40 then 1 else 0 end) as age
            from people p
           ) p
     )
select p.*
from p
where (height = 1 and male = 0 and age = 0 and seqnum <= 300) or
      (height = 0 and male = 0 and age = 0 and seqnum <= 100) or
      (height = 0 and male = 1 and age = 1 and seqnum <= 400) or
      (height = 0 and male = 1 and age = 0 and seqnum <= 200);

您可以使用另一种方法，均匀地填充8个桶，跟踪每个维度的数字（年龄/年龄，男/女，更短/更高）。然后在填充第一个维度时停止填充并继续填充4个互补单元格。重复此过程，直到获得所需的数字。

Answer 2

我最终选择suggested的第一种方法Gordon Linoff并做了一些小修改。我保留了最初的想法，但还引入了几个额外的子查询，以指定组内记录的所需分布，并构建一个矩阵，每个组具有所需的记录计数。还有全局参数段，其中包含指定总记录数的唯一参数。

查询产生非常有用的结果：

with 
    people as (
        select  id,
                floor(months_between(sysdate, date_birth)/12) age,
                195 - least(floor(months_between(sysdate, date_birth)/12), 50) height,
                decode(sex, 1, 'male', 'female') gender
        from    my_people_table
        where   date_birth is not null and rownum < 100000
    ),
    params as ( /* Global params */
        select  100 rec_count   -- total record count 
        from dual
    ),
    age_groups as (     /* distribution by height */
        select  'group 1' age_group, .7 prc from dual union
        select  'group 2' age_group, .3 prc from dual  
    ),
    height_groups as ( /* distribution by height */
        select  'group 1' height_group, .6 prc from dual union
        select  'group 2' height_group, .4 prc from dual  
    ),
    genders as (       /* distribution by gender */
        select  'male'   gender, .6 prc from dual union
        select  'female' gender, .4 prc from dual  
    ),
    mx as (            /* a matrix with record counts per group */
        select  age_group, height_group, gender,
                ceil(
                    age_groups.prc * 
                    height_groups.prc * 
                    genders.prc * 
                    rec_count
                )  rec_count       
        from    age_groups, height_groups, genders, params
    ),
    xpeople as (       /* Minor transformations - groups and group counters */
        select  p.*, 
                row_number() over (
                    partition by age_group, height_group, gender
                        order by age_group, height_group, gender
                ) rec_num
        from (                             
                select  people.*,
                        case 
                            when age    <=  40 then 'group 1' 
                                               else 'group 2' 
                        end age_group,
                        case 
                            when height <= 180 then 'group 1' 
                                               else 'group 2' 
                        end height_group
                from    people
        ) p
    )
/* the resulting query uses the matrix to filter the records */    
select  xpeople.*
from    xpeople join mx 
            on  xpeople.age_group = mx.age_group 
            and xpeople.height_group = mx.height_group      
            and xpeople.gender = mx.gender
            and xpeople.rec_num <= mx.rec_count

感谢您的帮助！

提取特定数量的记录，以满足某些总体条件

2 个答案: