我正在尝试计算人们在特定类别上花费的时间中位数。我拥有的整个数据集大约有500k行,但我试图在下面总结一下它的片段
person category time spent (in mins)
roger dota 20
jim dota 50
joe call of duty 5
jim fallout 25
kathy GTA 40
alicia fallout 100
我试过使用下面的查询,但我没有在哪里。
SELECT x1.person, x1.time spent
from data x1, data x2
GROUP BY x1.val
HAVING SUM(SIGN(1-SIGN(x2.val-x1.val))) = (COUNT(*)+1)/2
答案 0 :(得分:1)
500,000行的自联接可能很昂贵。为什么不只是枚举行并抓住中间的行?
select d.*
from (select d.*, (@rn := @rn + 1) as rn
from data d cross join
(select @rn := 0) params
order by d.val
) d
where 2*rn in (@rn, @rn + 1);
奇怪的where
子句选择中间的值 - 如果有一个前夕行数,它只是一个近似值。因为您需要实际的行值,所以需要近似值。中位数本身的正常计算将是:
select avg(d.val)
from (select d.*, (@rn := @rn + 1) as rn
from data d cross join
(select @rn := 0) params
order by d.val
) d
where 2*rn in (@rn - 1, @rn, @rn + 1);
编辑:
同样的逻辑也适用于每个人,但有更多的逻辑来获得总体计数:
select d.person, avg(val) as median
from (select d.*,
(@rn := if(@p = person, @rn + 1
if(@p := person, 1, 1)
) as rn
from data d cross join
(select @rn := 0, @p := '') params
order by person, d.val
) d join
(select person, count(*) as cnt
from data
group by person
) p
on d.person = p.person
where 2*rn in (d.cnt - 1, d.cnt, d.cnt + 1)
group by person;