我想先过滤一个表,然后计算过滤后的子集的某些列的平均值和标准差。此后,我想根据返回的平均值和标准偏差进一步过滤子集。我将如何在一个查询中尝试此操作?
这就是我的桌子的样子
| id | day | speed | name |nor
1 2017-02-03 00:00:00 -12.3 SYN 10
2 2018-02-03 00:00:00 -6.36 SYN 13
3 2015-02-03 00:00:00 -26.36 SYN 24
4 2017-02-03 00:00:00 -156.36 SYN 16
5 2017-02-03 00:00:00 -36.36 YRT 136
6 2017-02-03 00:00:00 -16.36 SYN 13
在第一个过滤器之后(内部查询如下所示),它看起来像这样:
|day |speed |nor
2017-02-03 00:00:00 12.30 10
2018-02-03 00:00:00 6.36 13
2017-02-03 00:00:00 16.36 24
现在,对于该子集,我想找到speed
和nor
的标准偏差和平均值,然后再次过滤该子集。因此,例如,当我过滤小于avg(speed)+ 3 *的值时,如果3行的平均值speed
为10,标准偏差为1,而平均值nor
为14,标准偏差为3, std偏差(速度)和avg(正常)+ 3 * std偏差(正常),我应该获得第1行和第2行。
这是我尝试的方法,但是会导致Invalid use of group function
错误。
SELECT t1.day, t1.speed, t1.nor FROM (
SELECT report.day AS day,
abs(report.speed) AS speed,
report.nor AS nor FROM report
WHERE
report.name = 'SYN'
AND
report.day > '2016-01-01 00:00:00'
AND
report.speed
BETWEEN
-40 AND -0.0001
) AS t1
WHERE
t1.speed < AVG(t1.speed) + 3 * STD(t1.speed)
AND
t1.nor < AVG(t1.speed) + 3 * STD(t1.speed)
答案 0 :(得分:1)
还有一些事情尚不完全清楚,但是为了让您了解如何实现这一点,我将概述如何使用您提供的数据来处理此问题。
首先,我在答案中提到使用HAVING
,但这不起作用,因为它将与分组结果一起使用,这里您要根据已过滤的平均值过滤原始表行桌子。
第二,请注意,过滤后的行的平均值和标准偏差不是您在文本中提到的,因此下面执行的查询不会给出您建议的结果。
mysql> SELECT AVG(speed), STD(speed) -> FROM report
-> WHERE report.name = 'SYN'
-> AND report.day > '2016-01-01 00:00:00'
-> AND report.speed BETWEEN -40 AND -0.0001;
+---------------------+-------------------+
| AVG(speed) | STD(speed) |
+---------------------+-------------------+
| -11.673333644866943 | 4.106461218126736 |
+---------------------+-------------------+
1 row in set (0.00 sec)
换句话说,它不是上面文本中提到的10和1。
对于您的特定情况,您想使用过滤的查询,然后对它执行一组不同的操作。这是使用公用表表达式(CTE)的完美案例,该表存在于MySQL 8.0中。要使用此功能,您需要定义t1
表并在CTE的SELECT
正文中多次使用它。我在这里只安装了5.7,如果使用PostgreSQL,它将看起来像这样:
WITH
t1 AS (
SELECT id, day, ABS(report.speed) AS speed, nor
FROM report
WHERE report.name = 'SYN'
AND report.day > '2016-01-01 00:00:00'
AND report.speed BETWEEN -40 AND -0.0001)
SELECT id, day, speed, nor
FROM t1
WHERE speed < (SELECT AVG(speed) + 3 * STDDEV_POP(speed) FROM t1)
AND nor < (SELECT AVG(speed) + 3 * STDDEV_POP(speed) FROM t1);
这将导致:
id | day | speed | nor
----+---------------------+-------+-----
1 | 2017-02-03 00:00:00 | 12.3 | 10
2 | 2018-02-03 00:00:00 | 6.36 | 13
6 | 2017-02-03 00:00:00 | 16.36 | 13
(3 rows)
如果您使用的是MySQL 5.7,但没有CTE,则必须使用上述CTE的SELECT
主体,并对上述t1
的每种情况重复“过滤查询” ,这将为您提供以下信息:
SELECT t1.day, t1.speed, t1.nor
FROM (SELECT id, day, ABS(report.speed) AS speed, nor
FROM report
WHERE report.name = 'SYN'
AND report.day > '2016-01-01 00:00:00'
AND report.speed BETWEEN -40 AND -0.0001) AS t1
WHERE t1.speed < (SELECT AVG(ABS(speed)) + 3 * STD(ABS(speed))
FROM report
WHERE report.name = 'SYN'
AND report.day > '2016-01-01 00:00:00'
AND report.speed BETWEEN -40 AND -0.0001)
AND t1.nor < (SELECT AVG(ABS(speed)) + 3 * STD(ABS(speed))
FROM report
WHERE report.name = 'SYN'
AND report.day > '2016-01-01 00:00:00'
AND report.speed BETWEEN -40 AND -0.0001);
您可能很想在WHERE
子句中为子查询创建一个派生表,但这不起作用,因为包含派生表的范围未扩展到{{ 1}}子句。
更新。我必须部分纠正自己,但这是有关派生表的故事。
尝试创建派生表WHERE
来计算平均值和标准偏差,然后在t2
子句中的子查询中使用它,将失败,并显示错误WHERE
:>
ERROR 1146 (42S02): Table 'test.t2' doesn't exist
这是因为SELECT t1.day, t1.speed, t1.nor
FROM (SELECT id, day, ABS(report.speed) AS speed, nor
FROM report
WHERE report.name = 'SYN'
AND report.day > '2016-01-01 00:00:00'
AND report.speed BETWEEN -40 AND -0.0001) AS t1,
(SELECT AVG(ABS(speed)) AS avg, STD(ABS(speed)) AS std
FROM report
WHERE report.name = 'SYN'
AND report.day > '2016-01-01 00:00:00'
AND report.speed BETWEEN -40 AND -0.0001) AS t2
WHERE t1.speed < (SELECT avg + 3 * std FROM t2)
AND t1.nor < (SELECT avg + 3 * std FROM t2);
的范围未扩展到t2
子句中的子查询。 确实,但是如果您不使用子查询,则可以工作:
WHERE