我正在使用MySQL,而我的SQL表看起来像这样:
sales_year (INT), sales_month (INT), sales_day (INT), price (float), customer_type (TEXT)
我想知道哪个sql_query
会按季度汇总价格数据(计算每个季度的中位数价格,以及用于计算中位数的观察数量),并按客户类型分组。
我正在努力完成两个主要步骤: mySQL似乎不支持中位数,以及如何按季度汇总数据 - 似乎按客户类型进行分组,一旦解决了这两个问题就很容易。
STRUGGLE - 计算中位数......
例如,我尝试创建一个四分之一列并且它可以工作,但它计算AVG而不是中位数:
select avg(price) as avg_price, floor(sales_month/3.0+1) as
sales_quarter, count(*) as n_transactions, sales_year, customer_type
from mydb.mytable
group by sales_quarter, sales_year, customer_type;
此命令完全正常。但理想情况下,我可以通过MEDIAN更改avg,但mySQL没有这样的支持,有关如何更改此代码以使其适用于中等目的的任何建议吗?
注意:我也尝试在此site中使用用户定义的函数安装我自己的中值函数,但是C代码没有在我的mac os X上编译。
所以输出看起来像这样:
sales_quarter (INT)
sales_year (INT)
median_price (FLOAT)
number_users_used_to_compute_median (INT)
customer_type (TEXT)
答案 0 :(得分:1)
要获得中位数,你可以尝试类似的东西,
SELECT *
FROM table
LIMIT COUNT(*)/2, 1
这基本上说:"从第n / 2个元素开始给我1个项目,其中n是集合的大小。"
所以,如果你按季度进行,那么就会有一些GROUP BY季度类型的东西也是如此。如果你想让我更多地扩展它,请告诉我。
答案 1 :(得分:1)
参考velcrow's answer并发布SqlFiddle here。
select quarter,
group_concat(val order by row_number) ValSortString,
floor((max(row_number) - min(row_number))/2)+1 as FirstPosition,
ceil((max(row_number) - min(row_number))/2) +1 as SecondPosition,
split_str(group_concat(val order by row_number),',',floor((max(row_number) - min(row_number))/2)+1) as FirstVal,
split_str(group_concat(val order by row_number),',',ceil((max(row_number) - min(row_number))/2)+1) as SecondVal,
(split_str(group_concat(val order by row_number),',',floor((max(row_number) - min(row_number))/2)+1) +
split_str(group_concat(val order by row_number),',',ceil((max(row_number) - min(row_number))/2)+1) )/2 as Median
from (
SELECT data.quarter,@rownum:=@rownum+1 as row_number, data.val,total_rows
FROM data ,
(select quarter,count(*) as total_rows from data group by quarter) as t,
(SELECT @rownum:=0) r
where t.quarter = data.quarter
order by data.quarter,val
) as b
group by quarter
此代码仅按季度分组,便于按列扩展其他组。
我使用group_concat和split_str来简化它只使用一个子查询。
所以你必须创建split_str函数:
CREATE FUNCTION SPLIT_STR(
x VARCHAR(255),
delim VARCHAR(12),
pos INT
)
RETURNS VARCHAR(255)
RETURN REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
delim, '');
问题是group_concat和split_str有限制。但是这个版本只能在子查询上解决问题并且易于理解。
<强>更新强>
根据Gordon Linoff的指标,我添加了另一个没有group_concat的解决方案。
select quarter,
floor((total_rows + 1)/2) as FirstPosition,
ceil((total_rows + 1)/2) as SecondPosition,
avg(val) as median
from (
SELECT data.quarter,
@rownum:= if (@q = data.quarter ,@rownum+1,if(@q := data.quarter, 1, 1) )as row_number,
data.val,
total_rows
FROM data ,
(select quarter,count(*) as total_rows from data group by quarter) as t,
(SELECT @q := '', @rownum:=0) r
where t.quarter = data.quarter
order by data.quarter,val
) as b
where row_number in (floor((total_rows + 1)/2), ceil((total_rows + 1)/2))
group by quarter
我是mysql的新手,这个问题很容易被MSSql,DB2或Oracle解决,他们都有Row_number()(Partition by ...)
。
我没有足够的声誉来评论Gordon Linoff的回答,我要感谢他学习如何实现row_number()(Partition by ...)
功能。
答案 2 :(得分:1)
哦,只是将平均值称为中位数。与你交谈的人通常不会知道差异(;)。
好的,说真的,你可以在MySQL中做到这一点。有一种使用group_concat()
和substring_index()
的方法,但是存在溢出中间字符串值的风险。相反,枚举值并进行简单的算术运算。为此,您需要枚举和总计。枚举是:
select t.*,
@rn := if(@q = quarter and @y = @year and @ct = customer_type,
@rn + 1,
if(@q := quarter, if(@y := @year, if(@ct := customer_type, 1, 1), 1), 1)
) as rn
from mydb.mytable t cross join
(select @q := '', @y := '', @ct := '', @rn := 0) vars
order by sales_quarter, sales_year, customer_type, price;
这是精心制定的。 order by
列对应于定义的变量。只有一个语句可以在select
中分配变量。嵌套的if()
语句确保每个变量都被设置(使用and
或or
可能导致短路)。重要的是要记住,MySQL不保证select
中表达式的评估顺序,因此只有一个语句集变量对于确保正确性非常重要。
现在,获得中位数非常容易。您需要总计数,顺序值(rn
)和一些算术来处理偶数个值的情况:
select trn.sales_quarter, trn.sales_year, trn.customer_type, avg(price) as median
from (select t.*,
@rn := if(@q = quarter and @y = @year and @ct = customer_type,
@rn + 1,
if(@q := quarter, if(@y := @year, if(@ct := customer_type, 1, 1), 1), 1)
) as rn
from mydb.mytable t cross join
(select @q := '', @y := '', @ct := '', @rn := 0) vars
order by sales_quarter, sales_year, customer_type, price
) trn join
(select sales_quarter, sales_year, customer_type, count(*) as numrows
from mydb.mytable t
group by sales_quarter, sales_year, customer_type
) s
on trn.sales_quarter = s.sales_quarter and
trn.sales_year = s.sales_year and
trn.customer_type = s.customer_type
where 2*rn in (numrows, numrows - 1, numrows + 1)
group by trn.sales_quarter, trn.sales_year, trn.customer_type;
只是强调最终平均值不进行平均计算。它正在计算中位数。正常的定义是,对于偶数个值,中位数是中间两个的平均值。 where
子句处理偶数和奇数情况。
答案 3 :(得分:0)
我知道有两种方法可以做到这一点。第一个使用两个选择和一个联接,第一个选择获取值和排名,第二个选择获取计数,然后联接它们。第二种使用json函数来一次选择所有内容。它们都有点长,但是它们可以工作并且相当快。
解决方案1(两个选择和一个联接,一个获得计数,一个获得排名)
SELECT x.group_field,
avg(
if(
x.rank - y.vol/2 BETWEEN 0 AND 1,
value_field,
null
)
) as median
FROM (
SELECT group_field, value_field,
@r:= IF(@current=group_field, @r+1, 1) as rank,
@current:=group_field
FROM (
SELECT group_field, value_field
FROM table_name
ORDER BY group_field, value_field
) z, (SELECT @r:=0, @current:='') v
) x, (
SELECT group_field, count(*) as vol
FROM table_name
GROUP BY group_field
) y WHERE x.group_field = y.group_field
GROUP BY x.group_field;
解决方案2(使用json对象存储计数并避免联接)
SELECT group_field,
avg(
if(
rank - json_extract(@vols, path)/2 BETWEEN 0 AND 1,
value_field,
null
)
) as median
FROM (
SELECT group_field, value_field, path,
@rnk := if(@curr = group_field, @rnk+1, 1) as rank,
@vols := json_set(
@vols,
path,
coalesce(json_extract(@vols, path), 0) + 1
) as vols,
@curr := group_field
FROM (
SELECT p.group_field, p.value_field, concat('$.', p.group_field) as path
FROM table_name
JOIN (SELECT @curr:='', @rnk:=1, @vols:=json_object()) v
ORDER BY group_field, value_field DESC
) z
) y GROUP BY group_field;