标准差分析可以是查找异常值的有用方法。有没有办法合并此查询的结果(找出远离平均值的第四个标准差的值)...
SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as high FROM [publicdata:samples.natality];
result = 12.721342001626912
...进入另一个查询,生成有关哪些州和哪个州出生的婴儿比平均值高出4个标准偏差的信息?
SELECT state, year, month ,COUNT(*) AS outlier_count
FROM [publicdata:samples.natality]
WHERE
(weight_pounds > 12.721342001626912)
AND
(state != '' AND state IS NOT NULL)
GROUP BY state, year, month
ORDER BY outlier_count DESC;
结果:
Row state year month outlier_count
1 MD 1990 12 22
2 NY 1989 10 17
3 CA 1991 9 14
基本上将它组合成单个查询会很棒。
答案 0 :(得分:4)
你可以滥用JOIN(因此性能会受到影响):
SELECT n.state, n.year, n.month ,COUNT(*) AS outlier_count
FROM (
SELECT state, year, month, weight_pounds, 1 as key
FROM [publicdata:samples.natality]) as n
JOIN (
SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as giant_baby,
1 as key
FROM [publicdata:samples.natality]) as o
ON n.key = o.key
WHERE
(n.weight_pounds > o.giant_baby)
AND
(n.state != '' AND n.state IS NOT NULL)
GROUP BY n.state, n.year, n.month
ORDER BY outlier_count DESC;