MySql查找按另一个列分组的列的中位数,而不是整列的中位数

时间:2016-08-26 03:45:18

标签: mysql median economics

背景:

我试图进行一系列市场交易,并确定每种物品类型实际移动的金额。这几乎是我在MySql上的第一次尝试,所以查询很难看,但以下几乎可以工作:

SELECT types.typename,
       averages.type,
       averages.price,
       movement.sold,
       ( averages.price * movement.sold ) AS value
FROM   (SELECT type,
               Round(Avg(price)) AS price
        FROM   orders
        GROUP  BY type) AS averages
       INNER JOIN (SELECT type,
                          ( startingvolume - currentvolume ) AS sold
                   FROM   (SELECT type,
                                  Sum(volume)        AS currentVolume,
                                  Sum(volumeentered) startingVolume
                           FROM   orders
                           GROUP  BY type) AS movement
                   WHERE  ( startingvolume - currentvolume ) > 10000
                   ORDER  BY sold) AS movement
               ON averages.type = movement.type
       INNER JOIN invtypes AS types
               ON types.typeid = averages.type
ORDER  BY value DESC
LIMIT  10 ;

-

+------------------------------------+-------+---------+------------+------------------+
| typeName                           | type  | price   | sold       | value            |
+------------------------------------+-------+---------+------------+------------------+
| Dirt                               |    34 | 1904767 | 2670581874 | 5086836224393358 |
| Light Wood                         |  2629 |   42999 |    2756595 |     118530828405 |
| Dark Wood                          | 24509 |   47344 |    1107771 |      52446310224 |
| Stone                              | 21922 |   18386 |    1505884 |      27687183224 |
| Grass                              |   238 |    5643 |    4554470 |      25700874210 |
| Paper                              |  3814 |   25635 |     861006 |      22071888810 |
| Iron                               |  3699 |  320270 |      58833 |      18842444910 |
| Ink                                | 16275 |    8552 |    2200545 |      18819060840 |
| Loam                               |  2679 |    5759 |    2608771 |      15023912189 |
| Copper                             |   672 |  904612 |      14989 |      13559229268 |
+------------------------------------+-------+---------+------------+------------------+

上述数据存在的问题是原始市场数据不可避免地受到异常值的破坏,如下所示:

select type, price from orders where type = 34 order by price desc limit 10;

-

+------+-----------+
| type | price     |
+------+-----------+
|   34 | 200000000 |
|   34 |     15.99 |
|   34 |     12.06 |
|   34 |        10 |
|   34 |      7.67 |
|   34 |       7.5 |
|   34 |       7.3 |
|   34 |      7.17 |
|   34 |       7.1 |
|   34 |      7.06 |
+------+-----------+

核心问题:

99%的市场数据是干净的,但异常值会破坏平均值,而MySql似乎没有中位数功能。我已经找到了几个如何找到整个列的中位数的例子,但我需要每个项目的中位数。

如何在运行主查询之前确定每个项目的中位数而不是每个项目的平均值,还是有效地清理这些异常值的数据?

注意: 我尝试通过std省略结果,但物品价格从17美元到10亿美元不等,而偏差仍然相对较低,无论价格范围如何。

1 个答案:

答案 0 :(得分:0)

我不会触摸您的原始查询,因为它非常复杂,但您可以做的一个选项是使用子查询删除任何统计异常值。例如,如果您想从orders表中删除任何异常值,这些异常值的值超过您可以使用的平均值的两个标准偏差:

SELECT t1.type,
       t1.price
FROM orders t1
INNER JOIN
(
    SELECT type,
           AVG(price) AS AVG,
           STD(price) AS STD
    FROM orders
    GROUP BY type
) t2
    ON t1.type = t2.type
WHERE t1.price < ABS(2*t2.STD - t2.AVG)  -- any value more than 2 standard devations
                                         -- away from the mean is discarded

在这里演示:

SQLFiddle