MySQL忽略异常值

时间:2017-08-21 19:56:31

标签: mysql sql stdev

我必须向同事提供一些数据,我在MySQL中分析它时遇到问题。

我有一张名为'付款的表格。每笔付款都包含以下列:

  1. 客户(我们的客户,例如银行)
  2. Amount_gbp(相当于交易价值的GBP)
  3. 货币
  4. Origin_country
  5. Client_type(个人或公司)
  6. 我编写了非常简单的查询,例如:

    SELECT  
        AVG(amount_GBP), 
        COUNT(client) AS '#Of Results'
    FROM payments
    
    WHERE client_type = 'individual'
        AND amount_gbp IS NOT NULL
        AND currency = 'TRY'
        AND country_origin = 'GB'
        AND date_time BETWEEN '2017/1/1' AND '2017/9/1'
    

    但我真正需要做的是从平均值中消除异常值AND / OR仅包括与平均值相差多个标准偏差的结果。

    例如,忽略2%结果的上/下10个结果等。 AND /或忽略来自均值

    的2个STDEV之外的任何结果

    有人可以帮忙吗?

1 个答案:

答案 0 :(得分:0)

---编辑回答 - 尝试并让我知道---

您最好的方法是创建一个包含avg和std_dev值的TEMPORARY表,并与它们进行比较。如果这不可行,请告诉我:

CREATE TEMPORARY TABLE payment_stats AS
 SELECT
  AVG(p.amount_gbp) as avg_gbp,
  STDDEV(amount_gbp) as std_gbp,
  (SELECT MIN(srt.amount_gbp) as max_gbp
    FROM (SELECT amount_gbp
     FROM payments
     <... repeat where no p. ...>
     ORDER BY amount_gbp DESC
     LIMIT <top_numbers to ignore>
   ) srt
  ) max_g,
  (SELECT MAX(srt.amount_gbp) as min_gbp
    FROM (SELECT amount_gbp
     FROM payments
     <... repeat where no p. ...>
     ORDER BY amount_gbp ASC
     LIMIT <top_numbers to ignore>
   ) srt
  ) min_g
 FROM payments
 WHERE client_type = 'individual'
  AND amount_gbp IS NOT NULL
  AND currency = 'TRY'
  AND country_origin = 'GB'
  AND date_time BETWEEN '2017/1/1' AND '2017/9/1';

然后,您可以与临时表进行比较

SELECT  
 AVG(p.amount_gbp) as avg_gbp, 
 COUNT(p.client) AS '#Of Results'
FROM payments p
WHERE
 p.amount_gbp >= (SELECT (avg_gbp - std_gbp*2) 
                FROM payment_stats)
 AND p.amount_gbp <= (SELECT (avg_gbp + std_gbp*2) 
                FROM payment_stats)
 AND p.amount_gbp > (SELECT min_g FROM payment_stats)
 AND p.amount_gbp < (SELECT max_g FROM payment_stats)
 AND p.client_type = 'individual'
 AND p.amount_gbp IS NOT NULL
 AND p.currency = 'TRY'
 AND p.country_origin = 'GB'
 AND p.date_time BETWEEN '2017/1/1' AND '2017/9/1';

- 稍后

DROP TEMPORARY TABLE payment_stats;

注意我必须重复WHERE条件。同时将*2更改为<factor>到您需要的任何内容!

仍然是Phew!

每次比较都会检查不同的统计数据

让我知道这是否更好