是否有直接的方法来计算已由指标汇总的数据的中位数?换句话说,我有一个表,其中测量是组的一部分,并记录每个测量的计数。
CREATE TABLE MEASUREMENTS AS
SELECT 'RED' COLOR, 4 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED' COLOR, 5 MEASUREMENT, 3 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'RED' COLOR, 6 MEASUREMENT, 1 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 5 MEASUREMENT, 4 MEASURE_COUNT FROM DUAL UNION ALL
SELECT 'BLUE' COLOR, 6 MEASUREMENT, 5 MEASURE_COUNT FROM DUAL ;
╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED ║ 4 ║ 5 ║
║ RED ║ 5 ║ 3 ║
║ RED ║ 6 ║ 1 ║
║ BLUE ║ 5 ║ 4 ║
║ BLUE ║ 6 ║ 5 ║
╚═══════╩═════════════╩═══════════════╝
“自然”解决方案是将测量计数分解为具有值的个别行,然后使用Oracle提供的MEDIAN进行分组 - 数学将如下所示:
RED=>(4,4,4,4,4,5,5,5,6), median = 4
BLUE=>(5,5,5,5,6,6,6,6,6), median = 6
但是(1)我正在处理数以百万计的行,这些行会爆炸到数百万次的单独测量,以及(2)感觉我正在“撤消和重做”数学上昂贵的中位数工作。
因为我想对此进行视图定义,并且将分析嵌入到视图中往往会削弱执行计划,所以我想避免这样的事情:
SELECT COLOR,
MIN(MEASUREMENT) MEDIAN_MEASUREMENT
FROM
(
SELECT COLOR,
MEASUREMENT,
SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR ORDER BY MEASURE_COUNT) /
SUM(MEASURE_COUNT) OVER (PARTITION BY COLOR) PCT
FROM MEASUREMENTS
)
WHERE PCT >=.5
GROUP BY COLOR
如果在数学上可行,我更倾向于使用直接GROUP BY(AVG的示例)来完成某些事情:
SELECT COLOR,
SUM(MEASUREMENT) / SUM(MEASURE_COUNT) AVG_MEASUREMENT
-- MEDIAN LOGIC (???) HERE
FROM MEASUREMENTS
GROUP BY COLOR
答案 0 :(得分:3)
如果我理解正确,我可以看到一种相当直接的方式,我想我可以清楚地描述它。我很确定我今天无法在SQL中表达它,但我会在浏览器中打开此选项卡,如果没有其他人贡献,我会明天尝试使用它。
╔═══════╦═════════════╦═══════════════╗
║ COLOR ║ MEASUREMENT ║ MEASURE_COUNT ║
╠═══════╬═════════════╬═══════════════╣
║ RED ║ 4 ║ 5 ║
║ RED ║ 5 ║ 3 ║
║ RED ║ 6 ║ 1 ║
║ BLUE ║ 5 ║ 4 ║
║ BLUE ║ 6 ║ 5 ║
╚═══════╩═════════════╩═══════════════╝
首先,计算哪个测量值代表中位数。您可以仅根据计数来做到这一点。例如,对于红色,总共有九个测量值。中位数测量将是第5次测量。用于此的SQL应该很简单。
其次,我认为您可以使用分析函数确定中位数测量的哪一行。对于红色,您可以确定第5次测量所在的行;它在第一行。这有点像“运行平衡”问题。该行中“measurement”列的值是您要确定的值。
代码墙(我认为在标准SQL中)
“展开”聚合是昂贵的。所以这可能对你没用。我依靠公用表表达式来减轻我的大脑负荷。
with measurements as (
select 'red' color, 4 measurement, 5 measure_count union all
select 'red' color, 5 measurement, 3 measure_count union all
select 'red' color, 6 measurement, 1 measure_count union all
select 'blue' color, 5 measurement, 4 measure_count union all
select 'blue' color, 6 measurement, 5 measure_count union all
-- Added green, even number of measurements, median should be 5.5.
select 'green' color, 5 measurement, 4 measure_count union all
select 'green' color, 6 measurement, 4 measure_count union all
-- Added yellow, extreme differences in measurements, median should be 6.
select 'yellow' color, 6 measurement, 2 measure_count union all
select 'yellow' color, 100 measurement, 1 measure_count
)
, measurement_starts as (
select
*,
sum(measure_count) over (partition by color order by measurement) total_rows_so_far
from measurements
)
, extended_measurements as (
select
color, measurement, measure_count,
coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + 1 measure_start_row,
coalesce(lag(total_rows_so_far) over (partition by color order by measurement), 0) + measure_count measure_end_row
from measurement_starts
)
, median_row_range as (
select color,
sum(measure_count) num_measurements,
ceiling(sum(measure_count)/2.0) start_measurement,
case
sum(measure_count) % 2 = 0
when true then ceiling(sum(measure_count)/2.0)+1
else ceiling(sum(measure_count)/2.0)
end
end_measurement
from measurements
group by color
)
, median_row_values as (
select m.color, c.measurement
from median_row_range m
inner join extended_measurements c
on c.color = m.color
and m.start_measurement between c.measure_start_row and c.measure_end_row
union all
select m.color, c.measurement
from median_row_range m
inner join extended_measurements c
on c.color = m.color
and m.end_measurement between c.measure_start_row and c.measure_end_row
)
select color, avg(measurement)
from median_row_values
group by color
order by color;
blue 6.00
green 5.50
red 4.00
yellow 6.00
CTE“extended_measurements”扩展了测量表,以包括您使用非聚合数据找到的起始“行”编号和结束“行”编号。
color measurement measure_count measure_start_row measure_end_row
--
blue 5 4 1 4
blue 6 5 5 9
green 5 4 1 4
green 6 4 5 8
red 4 5 1 5
red 5 3 6 8
red 6 1 4 4
yellow 6 2 1 2
yellow 100 1 3 3
CTE“median_row_range”确定中位数的起始“行”和结束“行”。
color num_measurements start_measurement end_measurement
--
blue 9 5 5
green 8 4 5
red 9 5 5
yellow 3 2 2
这意味着“蓝色”的中位数可以计算为第5个“行”和第5个“行”的平均值。也就是说,'blue'的中位数只是第5个值。绿色的中位数是第4个“行”和第5个“行”的平均值。
答案 1 :(得分:1)
这个答案背后的想法与迈克的想法相同,但执行方式各不相同。
<强>查询强>:
--get the midpoint and cumulative sum of measure_count
with extended_measurements as(
select color, measurement,
floor((sum(measure_count) over
(partition by color) + 1) * 0.5) floor_midpoint,
ceil((sum(measure_count) over
(partition by color) + 1) * 0.5) ceil_midpoint,
sum(measure_count) over
(partition by color order by measurement) cumltv_sum
from measurements
),
--assign rank to the measure_count where median lies
extended_measurements2 as(
select color, measurement,
case when floor_midpoint <= cumltv_sum
then row_number() over (partition by color order by measurement)
else null
end r1,
case when ceil_midpoint <= cumltv_sum
then row_number() over (partition by color order by measurement)
else null
end r2
from extended_measurements
)
--get the average of measurements that have least rank
select color, 0.5 * (
max(measurement) keep (dense_rank first order by r1) +
max(measurement) keep (dense_rank first order by r2)
) median
from extended_measurements2
group by color
order by color
<强> Result 强>:
| COLOR | MEDIAN |
|--------|--------|
| blue | 6 |
| green | 5.5 |
| red | 4 |
| white | 8 |
| yellow | 6 |
另一个fiddle用于验证非汇总和汇总数据的结果。