在行的子集上应用聚合函数,然后基于聚合过滤子集

时间:2020-04-30 18:32:36

标签: mysql

我想先过滤一个表,然后计算过滤后的子集的某些列的平均值和标准差。此后,我想根据返回的平均值和标准偏差进一步过滤子集。我将如何在一个查询中尝试此操作?

这就是我的桌子的样子

| id   | day                 | speed    | name |nor
 1      2017-02-03 00:00:00   -12.3      SYN    10
 2      2018-02-03 00:00:00   -6.36      SYN    13
 3      2015-02-03 00:00:00   -26.36     SYN    24
 4      2017-02-03 00:00:00   -156.36    SYN    16
 5      2017-02-03 00:00:00   -36.36     YRT    136
 6      2017-02-03 00:00:00   -16.36     SYN    13

在第一个过滤器之后(内部查询如下所示),它看起来像这样:

|day                    |speed       |nor
 2017-02-03 00:00:00     12.30        10
 2018-02-03 00:00:00      6.36        13
 2017-02-03 00:00:00     16.36        24

现在,对于该子集,我想找到speednor的标准偏差和平均值,然后再次过滤该子集。因此,例如,当我过滤小于avg(speed)+ 3 *的值时,如果3行的平均值speed为10,标准偏差为1,而平均值nor为14,标准偏差为3, std偏差(速度)和avg(正常)+ 3 * std偏差(正常),我应该获得第1行和第2行。

这是我尝试的方法,但是会导致Invalid use of group function错误。

SELECT t1.day, t1.speed, t1.nor FROM (

SELECT report.day AS day, 
        abs(report.speed) AS speed, 
        report.nor AS nor FROM report 
WHERE 
        report.name = 'SYN' 
AND         
        report.day > '2016-01-01 00:00:00' 
AND 
        report.speed 
BETWEEN 
        -40 AND -0.0001
) AS t1 

WHERE 
         t1.speed < AVG(t1.speed) + 3 * STD(t1.speed) 
AND 
         t1.nor < AVG(t1.speed) + 3 * STD(t1.speed)

1 个答案:

答案 0 :(得分:1)

还有一些事情尚不完全清楚,但是为了让您了解如何实现这一点,我将概述如何使用您提供的数据来处理此问题。

首先,我在答案中提到使用HAVING,但这不起作用,因为它将与分组结果一起使用,这里您要根据已过滤的平均值过滤原始表行桌子。

第二,请注意,过滤后的行的平均值和标准偏差不是您在文本中提到的,因此下面执行的查询不会给出您建议的结果。

mysql> SELECT AVG(speed), STD(speed)                                                                                                                                 ->   FROM report 
    ->  WHERE report.name = 'SYN' 
    ->    AND report.day > '2016-01-01 00:00:00' 
    ->    AND report.speed BETWEEN -40 AND -0.0001;
+---------------------+-------------------+
| AVG(speed)          | STD(speed)        |
+---------------------+-------------------+
| -11.673333644866943 | 4.106461218126736 |
+---------------------+-------------------+
1 row in set (0.00 sec)

换句话说,它不是上面文本中提到的10和1。

对于您的特定情况,您想使用过滤的查询,然后对它执行一组不同的操作。这是使用公用表表达式(CTE)的完美案例,该表存在于MySQL 8.0中。要使用此功能,您需要定义t1表并在CTE的SELECT正文中多次使用它。我在这里只安装了5.7,如果使用PostgreSQL,它将看起来像这样:

WITH
  t1 AS (
    SELECT id, day, ABS(report.speed) AS speed, nor
      FROM report 
     WHERE report.name = 'SYN' 
       AND report.day > '2016-01-01 00:00:00' 
       AND report.speed BETWEEN -40 AND -0.0001)
SELECT id, day, speed, nor
  FROM t1
 WHERE speed < (SELECT AVG(speed) + 3 * STDDEV_POP(speed) FROM t1)
   AND nor < (SELECT AVG(speed) + 3 * STDDEV_POP(speed) FROM t1);

这将导致:

 id |         day         | speed | nor 
----+---------------------+-------+-----
  1 | 2017-02-03 00:00:00 |  12.3 |  10
  2 | 2018-02-03 00:00:00 |  6.36 |  13
  6 | 2017-02-03 00:00:00 | 16.36 |  13
(3 rows)

如果您使用的是MySQL 5.7,但没有CTE,则必须使用上述CTE的SELECT主体,并对上述t1的每种情况重复“过滤查询” ,这将为您提供以下信息:

SELECT t1.day, t1.speed, t1.nor
  FROM (SELECT id, day, ABS(report.speed) AS speed, nor
          FROM report 
         WHERE report.name = 'SYN' 
       AND report.day > '2016-01-01 00:00:00' 
       AND report.speed BETWEEN -40 AND -0.0001) AS t1
 WHERE t1.speed < (SELECT AVG(ABS(speed)) + 3 * STD(ABS(speed))
          FROM report 
     WHERE report.name = 'SYN' 
       AND report.day > '2016-01-01 00:00:00' 
       AND report.speed BETWEEN -40 AND -0.0001)
   AND t1.nor < (SELECT AVG(ABS(speed)) + 3 * STD(ABS(speed))
          FROM report 
     WHERE report.name = 'SYN' 
       AND report.day > '2016-01-01 00:00:00' 
       AND report.speed BETWEEN -40 AND -0.0001);

您可能很想在WHERE子句中为子查询创建一个派生表,但这不起作用,因为包含派生表的范围未扩展到{{ 1}}子句。

更新。我必须部分纠正自己,但这是有关派生表的故事。

尝试创建派生表WHERE来计算平均值和标准偏差,然后在t2子句中的子查询中使用它,将失败,并显示错误WHERE

ERROR 1146 (42S02): Table 'test.t2' doesn't exist

这是因为SELECT t1.day, t1.speed, t1.nor FROM (SELECT id, day, ABS(report.speed) AS speed, nor FROM report WHERE report.name = 'SYN' AND report.day > '2016-01-01 00:00:00' AND report.speed BETWEEN -40 AND -0.0001) AS t1, (SELECT AVG(ABS(speed)) AS avg, STD(ABS(speed)) AS std FROM report WHERE report.name = 'SYN' AND report.day > '2016-01-01 00:00:00' AND report.speed BETWEEN -40 AND -0.0001) AS t2 WHERE t1.speed < (SELECT avg + 3 * std FROM t2) AND t1.nor < (SELECT avg + 3 * std FROM t2); 的范围未扩展到t2子句中的子查询确实,但是如果您使用子查询,则可以工作:

WHERE