按季度和按customer_type汇总mysql中的数据(按中位数)

时间:2014-08-12 14:52:58

标签: mysql sql

我正在使用MySQL,而我的SQL表看起来像这样:

sales_year (INT), sales_month (INT), sales_day (INT), price (float), customer_type (TEXT)

我想知道哪个sql_query会按季度汇总价格数据(计算每个季度的中位数价格,以及用于计算中位数的观察数量),并按客户类型分组。

我正在努力完成两个主要步骤: mySQL似乎不支持中位数,以及如何按季度汇总数据 - 似乎按客户类型进行分组,一旦解决了这两个问题就很容易。

STRUGGLE - 计算中位数......

例如,我尝试创建一个四分之一列并且它可以工作,但它计算AVG而不是中位数:

    select avg(price) as avg_price, floor(sales_month/3.0+1) as
    sales_quarter, count(*) as n_transactions, sales_year, customer_type
    from mydb.mytable
    group by sales_quarter, sales_year, customer_type;

此命令完全正常。但理想情况下,我可以通过MEDIAN更改avg,但mySQL没有这样的支持,有关如何更改此代码以使其适用于中等目的的任何建议吗?

注意:我也尝试在此site中使用用户定义的函数安装我自己的中值函数,但是C代码没有在我的mac os X上编译。

所以输出看起来像这样:

sales_quarter (INT)
sales_year (INT)
median_price (FLOAT)
number_users_used_to_compute_median (INT)
customer_type (TEXT)

4 个答案:

答案 0 :(得分:1)

要获得中位数,你可以尝试类似的东西,

SELECT *
FROM table
LIMIT COUNT(*)/2, 1

这基本上说:"从第n / 2个元素开始给我1个项目,其中n是集合的大小。"

所以,如果你按季度进行,那么就会有一些GROUP BY季度类型的东西也是如此。如果你想让我更多地扩展它,请告诉我。

答案 1 :(得分:1)

参考velcrow's answer并发布SqlFiddle here

select quarter,
       group_concat(val order by row_number) ValSortString,
       floor((max(row_number) - min(row_number))/2)+1 as FirstPosition,
       ceil((max(row_number) - min(row_number))/2) +1 as SecondPosition,
       split_str(group_concat(val order by row_number),',',floor((max(row_number) - min(row_number))/2)+1) as FirstVal,
       split_str(group_concat(val order by row_number),',',ceil((max(row_number) - min(row_number))/2)+1) as SecondVal,
       (split_str(group_concat(val order by row_number),',',floor((max(row_number) - min(row_number))/2)+1) +
        split_str(group_concat(val order by row_number),',',ceil((max(row_number) - min(row_number))/2)+1) )/2 as Median
from (
      SELECT data.quarter,@rownum:=@rownum+1 as row_number, data.val,total_rows
             FROM data ,  
                  (select quarter,count(*) as total_rows from data group by quarter) as t,
                  (SELECT @rownum:=0) r
             where t.quarter = data.quarter
             order by data.quarter,val
     ) as b
group by quarter

此代码仅按季度分组,便于按列扩展其他组。

我使用group_concat和split_str来简化它只使用一个子查询。

所以你必须创建split_str函数:

CREATE FUNCTION SPLIT_STR(
  x VARCHAR(255),
  delim VARCHAR(12),
  pos INT
)
RETURNS VARCHAR(255)
RETURN REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
       LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
       delim, '');

问题是group_concat和split_str有限制。但是这个版本只能在子查询上解决问题并且易于理解。

<强>更新
根据Gordon Linoff的指标,我添加了另一个没有group_concat的解决方案。

select quarter,
       floor((total_rows + 1)/2) as FirstPosition,
       ceil((total_rows + 1)/2) as SecondPosition,
       avg(val) as median
from (
      SELECT data.quarter,
             @rownum:= if (@q = data.quarter ,@rownum+1,if(@q := data.quarter, 1, 1) )as row_number, 
             data.val,
             total_rows
             FROM data ,  
                  (select quarter,count(*) as total_rows from data group by quarter) as t,
                  (SELECT @q := '', @rownum:=0) r
             where t.quarter = data.quarter
             order by data.quarter,val
     ) as b
where row_number in (floor((total_rows + 1)/2), ceil((total_rows + 1)/2))
group by quarter

Sql Fiddle here

我是mysql的新手,这个问题很容易被MSSql,DB2或Oracle解决,他们都有Row_number()(Partition by ...)

我没有足够的声誉来评论Gordon Linoff的回答,我要感谢他学习如何实现row_number()(Partition by ...)功能。

答案 2 :(得分:1)

哦,只是将平均值称为中位数。与你交谈的人通常不会知道差异(;)。

好的,说真的,你可以在MySQL中做到这一点。有一种使用group_concat()substring_index()的方法,但是存在溢出中间字符串值的风险。相反,枚举值并进行简单的算术运算。为此,您需要枚举和总计。枚举是:

  select t.*,
         @rn := if(@q = quarter and @y = @year and @ct = customer_type,
                   @rn + 1,
                   if(@q := quarter, if(@y := @year, if(@ct := customer_type, 1, 1), 1), 1)
                  ) as rn
  from mydb.mytable t cross join
       (select @q := '', @y := '', @ct := '', @rn := 0) vars
  order by sales_quarter, sales_year, customer_type, price;

这是精心制定的。 order by列对应于定义的变量。只有一个语句可以在select中分配变量。嵌套的if()语句确保每个变量都被设置(使用andor可能导致短路)。重要的是要记住,MySQL不保证select中表达式的评估顺序,因此只有一个语句集变量对于确保正确性非常重要。

现在,获得中位数非常容易。您需要总计数,顺序值(rn)和一些算术来处理偶数个值的情况:

select trn.sales_quarter, trn.sales_year, trn.customer_type, avg(price) as median
from (select t.*,
             @rn := if(@q = quarter and @y = @year and @ct = customer_type,
                       @rn + 1,
                       if(@q := quarter, if(@y := @year, if(@ct := customer_type, 1, 1), 1), 1)
                      ) as rn
      from mydb.mytable t cross join
           (select @q := '', @y := '', @ct := '', @rn := 0) vars
      order by sales_quarter, sales_year, customer_type, price
     ) trn join
     (select sales_quarter, sales_year, customer_type, count(*) as numrows
      from mydb.mytable t
      group by sales_quarter, sales_year, customer_type
     ) s
     on trn.sales_quarter = s.sales_quarter and
        trn.sales_year = s.sales_year and
        trn.customer_type = s.customer_type
where 2*rn in (numrows, numrows - 1, numrows + 1)
group by trn.sales_quarter, trn.sales_year, trn.customer_type;

只是强调最终平均值进行平均计算。它正在计算中位数。正常的定义是,对于偶数个值,中位数是中间两个的平均值。 where子句处理偶数和奇数情况。

答案 3 :(得分:0)

我知道有两种方法可以做到这一点。第一个使用两个选择和一个联接,第一个选择获取值和排名,第二个选择获取计数,然后联接它们。第二种使用json函数来一次选择所有内容。它们都有点长,但是它们可以工作并且相当快。

解决方案1(两个选择和一个联接,一个获得计数,一个获得排名)

SELECT  x.group_field, 
        avg(
            if( 
                x.rank - y.vol/2 BETWEEN 0 AND 1, 
                value_field, 
                null
            )
        ) as median
FROM (
    SELECT  group_field, value_field, 
            @r:= IF(@current=group_field, @r+1, 1) as rank, 
            @current:=group_field
    FROM (
        SELECT group_field, value_field
        FROM table_name
        ORDER BY group_field, value_field
    ) z, (SELECT @r:=0, @current:='') v
) x, (
    SELECT group_field, count(*) as vol 
    FROM table_name
    GROUP BY group_field
) y WHERE x.group_field = y.group_field
GROUP BY x.group_field;

解决方案2(使用json对象存储计数并避免联接)

SELECT group_field, 
    avg(
        if(
            rank - json_extract(@vols, path)/2 BETWEEN 0 AND 1,
            value_field,
            null
        )
    ) as median
FROM (
    SELECT group_field, value_field, path, 
        @rnk := if(@curr = group_field, @rnk+1, 1) as rank,
        @vols := json_set(
            @vols, 
            path, 
            coalesce(json_extract(@vols, path), 0) + 1
        ) as vols,
        @curr := group_field
    FROM (
        SELECT p.group_field, p.value_field, concat('$.', p.group_field) as path
        FROM table_name
        JOIN (SELECT @curr:='', @rnk:=1, @vols:=json_object()) v
        ORDER BY group_field, value_field DESC
    ) z
) y GROUP BY group_field;