计算数据mysql

时间:2015-11-17 23:13:30

标签: mysql sql

我正在尝试计算人们在特定类别上花费的时间中位数。我拥有的整个数据集大约有500k行,但我试图在下面总结一下它的片段

person category time spent (in mins)
roger  dota 20
jim    dota 50
joe    call of duty 5
jim    fallout 25
kathy  GTA 40
alicia fallout 100

我试过使用下面的查询,但我没有在哪里。

SELECT x1.person, x1.time spent 
from data x1, data x2
GROUP BY x1.val
HAVING SUM(SIGN(1-SIGN(x2.val-x1.val))) = (COUNT(*)+1)/2

1 个答案:

答案 0 :(得分:1)

500,000行的自联接可能很昂贵。为什么不只是枚举行并抓住中间的行?

select d.*
from (select d.*, (@rn := @rn + 1) as rn
      from data d cross join
           (select @rn := 0) params
      order by d.val
     ) d
where 2*rn in (@rn, @rn + 1);

奇怪的where子句选择中间的值 - 如果有一个前夕行数,它只是一个近似值。因为您需要实际的行值,所以需要近似值。中位数本身的正常计算将是:

select avg(d.val)
from (select d.*, (@rn := @rn + 1) as rn
      from data d cross join
           (select @rn := 0) params
      order by d.val
     ) d
where 2*rn in (@rn - 1, @rn, @rn + 1);

编辑:

同样的逻辑也适用于每个人,但有更多的逻辑来获得总体计数:

select d.person, avg(val) as median
from (select d.*,
             (@rn := if(@p = person, @rn + 1
                        if(@p := person, 1, 1)
             ) as rn
      from data d cross join
           (select @rn := 0, @p := '') params
      order by person, d.val
     ) d join
     (select person, count(*) as cnt
      from data
      group by person
     ) p
     on d.person = p.person
where 2*rn in (d.cnt - 1, d.cnt, d.cnt + 1)
group by person;