按指标预先汇总/已分组的中位数计算

时间:2014-04-03 03:08:50

标签: sql oracle statistics

是否有直接的方法来计算已由指标汇总的数据的中位数?换句话说,我有一个表,其中测量是组的一部分,并记录每个测量的计数。

CREATE TABLE MEASUREMENTS AS 
SELECT 'RED'  COLOR, 4 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED'  COLOR, 5 MEASUREMENT, 3 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED'  COLOR, 6 MEASUREMENT, 1 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 5 MEASUREMENT, 4 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 6 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL ;

╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED   ║           4 ║             5 ║
║ RED   ║           5 ║             3 ║
║ RED   ║           6 ║             1 ║
║ BLUE  ║           5 ║             4 ║
║ BLUE  ║           6 ║             5 ║
╚═══════╩═════════════╩═══════════════╝

“自然”解决方案是将测量计数分解为具有值的个别行,然后使用Oracle提供的MEDIAN进行分组 - 数学将如下所示:

RED=>(4,4,4,4,4,5,5,5,6), median = 4
BLUE=>(5,5,5,5,6,6,6,6,6), median = 6

但是(1)我正在处理数以百万计的行,这些行会爆炸到数百万次的单独测量,以及(2)感觉我正在“撤消和重做”数学上昂贵的中位数工作。

因为我想对此进行视图定义,并且将分析嵌入到视图中往往会削弱执行计划,所以我想避免这样的事情:

    SELECT  COLOR,
            MIN(MEASUREMENT) MEDIAN_MEASUREMENT
    FROM 
      (        
        SELECT  COLOR, 
                MEASUREMENT, 
                SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR ORDER BY MEASURE_COUNT)  / 
                    SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR) PCT
        FROM    MEASUREMENTS            
      )
    WHERE PCT >=.5  
    GROUP BY COLOR               

如果在数学上可行,我更倾向于使用直接GROUP BY(AVG的示例)来完成某些事情:

SELECT  COLOR, 
        SUM(MEASUREMENT) / SUM(MEASURE_COUNT) AVG_MEASUREMENT
        -- MEDIAN LOGIC (???) HERE  
FROM    MEASUREMENTS
GROUP BY COLOR

2 个答案:

答案 0 :(得分:3)

如果我理解正确,我可以看到一种相当直接的方式,我想我可以清楚地描述它。我很确定我今天无法在SQL中表达它,但我会在浏览器中打开此选项卡,如果没有其他人贡献,我会明天尝试使用它。

╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED   ║           4 ║             5 ║
║ RED   ║           5 ║             3 ║
║ RED   ║           6 ║             1 ║
║ BLUE  ║           5 ║             4 ║
║ BLUE  ║           6 ║             5 ║
╚═══════╩═════════════╩═══════════════╝

首先,计算哪个测量值代表中位数。您可以仅根据计数来做到这一点。例如,对于红色,总共有九个测量值。中位数测量将是第5次测量。用于此的SQL应该很简单。

其次,我认为您可以使用分析函数确定中位数测量的哪一行。对于红色,您可以确定第5次测量所在的行;它在第一行。这有点像“运行平衡”问题。该行中“measurement”列的值是您要确定的值。

代码墙(我认为在标准SQL中)

“展开”聚合是昂贵的。所以这可能对你没用。我依靠公用表表达式来减轻我的大脑负荷。

with measurements as (
  select 'red'   color, 4 measurement, 5 measure_count union all
  select 'red'   color, 5 measurement, 3 measure_count union all
  select 'red'   color, 6 measurement, 1 measure_count union all
  select 'blue'  color, 5 measurement, 4 measure_count union all
  select 'blue'  color, 6 measurement, 5 measure_count union all
  -- Added green, even number of measurements, median should be 5.5.
  select 'green' color, 5 measurement, 4 measure_count union all
  select 'green' color, 6 measurement, 4 measure_count union all
  -- Added yellow, extreme differences in measurements, median should be 6.
  select 'yellow' color, 6 measurement, 2 measure_count union all
  select 'yellow' color, 100 measurement, 1 measure_count 
)
, measurement_starts as (
  select 
    *,
    sum(measure_count) over (partition by color order by measurement) total_rows_so_far
  from measurements
)
, extended_measurements as (
  select 
    color, measurement, measure_count,
    coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + 1 measure_start_row,
    coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + measure_count measure_end_row 
  from measurement_starts
)
, median_row_range as (
  select color, 
    sum(measure_count) num_measurements, 
    ceiling(sum(measure_count)/2.0) start_measurement, 
    case 
      sum(measure_count) % 2 = 0
      when true then ceiling(sum(measure_count)/2.0)+1
      else ceiling(sum(measure_count)/2.0)
    end
    end_measurement
  from measurements
  group by color
)
, median_row_values as (
  select m.color, c.measurement
  from median_row_range m
  inner join extended_measurements c 
          on c.color = m.color 
         and m.start_measurement between c.measure_start_row and c.measure_end_row
  union all
  select m.color, c.measurement
  from median_row_range m
  inner join extended_measurements c 
          on c.color = m.color 
         and m.end_measurement between c.measure_start_row and c.measure_end_row
)
select color, avg(measurement)
from median_row_values
group by color
order by color;

blue    6.00
green   5.50
red     4.00
yellow  6.00

CTE“extended_measurements”扩展了测量表,以包括您使用非聚合数据找到的起始“行”编号和结束“行”编号。

color  measurement  measure_count  measure_start_row  measure_end_row
--
blue   5            4              1                  4
blue   6            5              5                  9
green  5            4              1                  4
green  6            4              5                  8
red    4            5              1                  5
red    5            3              6                  8
red    6            1              4                  4
yellow 6            2              1                  2
yellow 100          1              3                  3

CTE“median_row_range”确定中位数的起始“行”和结束“行”。

color  num_measurements  start_measurement  end_measurement
--
blue   9                 5                  5
green  8                 4                  5
red    9                 5                  5
yellow 3                 2                  2

这意味着“蓝色”的中位数可以计算为第5个“行”和第5个“行”的平均值。也就是说,'blue'的中位数只是第5个值。绿色的中位数是第4个“行”和第5个“行”的平均值。

答案 1 :(得分:1)

这个答案背后的想法与迈克的想法相同,但执行方式各不相同。

  • 第一次CTE extended_measurements,找到每种颜色的累计计数和中点。如果计数总和是偶数,那么您应该取两个值的平均值。因此,地板和ceil会给你这些积分。
  • 第二次CTE extended_measurements2,通过与累积和进行比较,尝试找到中点对应的测量值。这是针对floor_midpoint和ceil_midpodint完成的。排名已分配,因为我们只对匹配的第一条记录感兴趣。
  • 最终查询,仅选择具有最少排名的度量并查找平均值,即MEDIAN值。

SQL Fiddle

<强>查询

--get the midpoint and cumulative sum of measure_count
with extended_measurements as(
    select color, measurement,
           floor((sum(measure_count) over 
                    (partition by color) + 1) * 0.5)            floor_midpoint,
           ceil((sum(measure_count) over 
                    (partition by color) + 1) * 0.5)            ceil_midpoint,
           sum(measure_count) over 
                    (partition by color order by measurement)   cumltv_sum
    from measurements
),
--assign rank to the measure_count where median lies
extended_measurements2 as(
    select color, measurement,
           case when floor_midpoint <= cumltv_sum
                    then row_number() over (partition by color order by measurement)
                else null
           end r1,
           case when ceil_midpoint <= cumltv_sum
                    then row_number() over (partition by color order by measurement)
                else null
           end r2
    from extended_measurements
)
--get the average of measurements that have least rank
select color, 0.5 * (
                        max(measurement) keep (dense_rank first order by r1) + 
                        max(measurement) keep (dense_rank first order by r2)
                     )  median
from extended_measurements2
group by color
order by color

<强> Result

|  COLOR | MEDIAN |
|--------|--------|
|   blue |      6 |
|  green |    5.5 |
|    red |      4 |
|  white |      8 |
| yellow |      6 |

另一个fiddle用于验证非汇总和汇总数据的结果。