用于计算子组中的排名和中位数的SQL排名查询

时间:2013-04-11 10:35:12

标签: sql sqlite group-by ranking median

我想计算此简单y子组xy_table x | y --groups--> gid | x | y --medians--> gid | x | y ------- ------------- ------------- 0.1 | 4 0.0 | 0.1 | 4 0.0 | 0.1 | 4 0.2 | 3 0.0 | 0.2 | 3 | | 0.7 | 5 1.0 | 0.7 | 5 1.0 | 0.7 | 5 1.5 | 1 2.0 | 1.5 | 1 | | 1.9 | 6 2.0 | 1.9 | 6 | | 2.1 | 5 2.0 | 2.1 | 5 2.0 | 2.1 | 5 2.7 | 1 3.0 | 2.7 | 1 3.0 | 2.7 | 1

x

在此示例中,每个x都是唯一的,表格已按GROUP BY round(x)排序。 我现在想要y并获得每组中保持SELECT a.x, a.y FROM xy_table a,xy_table b WHERE a.y >= b.y GROUP BY a.x, a.y HAVING count(*) = (SELECT round((count(*)+1)/2) FROM xy_table) 中位数的元组。

我已经可以使用排名查询计算整个表格的中位数:

0.1, 4.0

输出:median()

但我还没有成功编写查询来计算子组的中位数。

注意:我没有PARTITION聚合功能。另请注意,我们不建议使用特殊RANKQUANTILEmedian()语句的解决方案(如类似但供应商特定Median中所述)。我需要简单的SQL(即,与没有{{1}}函数的SQLite兼容)

修改:我实际上是在寻找SO questions而不是Medoid

2 个答案:

答案 0 :(得分:3)

我建议用你的编程语言进行计算:

for each group:
  for each record_in_group:
    append y to array
  median of array

但是如果您遇到SQLite,可以按y订购每个组,并选择中间的记录,如http://sqlfiddle.com/#!5/d4c68/55/0

UPDATE :只有更大的“中位数”值才是重要的,即使是nr。行,因此不需要avg()

select groups.gid,
  ids.y median
from (
  -- get middle row number in each group (bigger number if even nr. of rows)
  -- note the integer divisions and modulo operator
  select round(x) gid,
    count(*) / 2 + 1 mid_row_right
  from xy_table
  group by round(x)
) groups
join (
  -- for each record get equivalent of
  -- row_number() over(partition by gid order by y)
  select round(a.x) gid,
    a.x,
    a.y,
    count(*) rownr_by_y
  from xy_table a
  left join xy_table b
    on round(a.x) = round (b.x)
    and a.y >= b.y
  group by a.x
) ids on ids.gid = groups.gid
where ids.rownr_by_y = groups.mid_row_right

答案 1 :(得分:0)

好的,这取决于临时表:

create temporary table tmp (x float, y float);

insert into tmp
  select * from xy_table order by round(x), y

但您可能会为您感兴趣的一系列数据创建此数据。另一种方法是确保xy_table具有此排序顺序,而不仅仅是x上的排序。原因是SQLite缺乏行编号功能。

然后:

select tmp4.x as gid, t.* from (
  select tmp1.x, 
         round((tmp2.y + coalesce(tmp3.y, tmp2.y)) / 2) as y -- <- for larger of the two, change to: (case when tmp2.y > coalesce(tmp3.y, 0) then tmp2.y else tmp3.y end)
  from (
    select round(x) as x, min(rowid) + (count(*) / 2) as id1, 
           (case when count(*) % 2 = 0 then min(rowid) + (count(*) / 2) - 1 
                 else 0 end) as id2
    from (  
      select *, rowid from tmp
    ) t
    group by round(x)
  ) tmp1
  join tmp tmp2 on tmp1.id1 = tmp2.rowid
  left join tmp tmp3 on tmp1.id2 = tmp3.rowid
) tmp4
join xy_table t on tmp4.x = round(t.x) and tmp4.y = t.y

如果你想将中位数视为两个中间值中较大的一个,这不符合@Aprillion已经指出的定义,那么你只需要取两个y值中较大的一个,而不是他们的平均值,在查询的第三行。