在尝试获得平均值时排除异常值

时间:2014-05-18 23:02:21

标签: php sql cakephp math

我正在将cakephp用于允许用户记录其物质使用的网站。

我想要的是获得特定物质的平均剂量。我通过设置虚拟场

完成了这项工作
    $this->RecordDrugUnit->virtualFields['sum'] ='AVG(RecordDrugUnit.dose)';

问题是,如果一个用户搞砸了并且有一个混乱的值,如100000克酒精,那么这将搞砸平均值。所以我想排除异常值,或以某种方式找出一种更好的方法来收集平均值。

有人对此有任何意见吗?

2 个答案:

答案 0 :(得分:1)

您可以通过排除距标准偏差太远的变量来做到这一点,其中3 *标准差与平均值通常被认为是“异常值”。如果您只想将非常排除在平均值之外,则可以增加stddev乘以的数量。这是一个非常简化的,未经优化的方法,您可以将其用作您选择使用的虚拟文件的起点:

mysql> select * from test;
+------+
| a    |
+------+
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
| 1000 |
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
+------+
11 rows in set (0.00 sec)

mysql> select * from test where (ABS(test.a - (select avg(a) from test)) < 3*(select stddev(a) from test));
+------+
| a    |
+------+
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
+------+
10 rows in set (0.00 sec)

我相信如果虚拟字段包含一个选择,那么只需运行该选择,因此您可以直接使用它。我在虚拟领域的快速,未经测试的尝试:

$this->RecordDrugUnit->virtualFields['sum'] = 'select AVG(rdu.dose) from RecordDrugUnit rdu where (ABS(rdu.dose - (select avg(dose) from rdu)) < 3*(select stddev(dose) from rdu))';

答案 1 :(得分:0)

您可以使用trimmed mean功能。如果您有兴趣,我可以在今天晚些时候添加该程序。

-- Add trimmed mean user defined function

-- ------------------------------ START FUNCTION --------------------------------
drop function if exists trimmed_mean;

delimiter //

create function trimmed_mean(
    -- data: comma separated list of numeric values, left-to-right sorted from low to high
    -- (you can use GROUP_CONCAT(x ORDER BY x) to let MySQL generate suchs lists for you
    data   text
    -- p: the percentage of the data points to trim.
,   p      tinyint
)
returns double
begin
    -- n: number of observations
    declare n int default 1 + length(data) - length(replace(data, ',', ''));
    -- m: number of observations to remove on both ends of the data set
    declare m int default n * p / 2 / 100;
    -- t: trimmed dataset
    declare t text default substring_index(substring_index(data, ',', n-m), ',', -(n-m-m));
    -- current character (for parsing numbers out of the dataset)
    declare c varchar(1);
    -- x: integer part of the data point, y: decimal part of the data point
    declare x, y varchar(32);
    -- z: number of decimals
    declare z int unsigned default 0;
    -- number of characters in the (trimmed) data set
    declare l int unsigned default length(t);
    -- i: current position in the data set, j: marks start of data point
    declare i, j int unsigned default 1;
    -- the sum of the integer parts of the data points
    declare v int default 0;
    -- d: the sum of the decimal parts of the data parts (as scaled integer), s: scaling factor
    declare d, s int unsigned default 0;
    repeat
        -- get the current character from the trimmed data set
        set c = substring(t, i, 1);
        -- check if current position is a data point separator (',') or end of data terminator ('')
        if substring(t, i, 1) in (',', '') then
            -- parse out a data point (from j up to i) into x; advance j and look for a decimal separator ('.')
            set x = substring(t, j, i - j),
                j = i + 1,
                d = instr(x, '.')
            ;
            -- if we have no decimals, then parse data point as integer and update our sum v with it.
            if d = 0 then
                set v = v + cast(x as signed);
            else
            -- we have decimals. Parse up to the decimal separator ('.') as integer and update our sum v.
            -- parse out the part after the decimal separator into y. Update the total number of decimals to keep track of in z
            -- Finally, pad our decimal parts sum with the number of decimals and prepend a 1 to not lose leading zeroes
                set v = v + cast(substring_index(x, '.', 1) as signed)
                ,   y = substring_index(x, '.', -1)
                ,   z = greatest(z, length(y))
                ,   d = cast(rpad(cast(d as char), '0', z + 1) as unsigned) + cast(rpad(concat('1', y), '0', z + 1) as unsigned)
                ;
            end if;
        end if;
        -- advance position to look at next character in the dataset.
        set i = i + 1;
    -- stop scanning when we ran trhough the dataset.
    until c = '' end repeat;
    -- compute the scaling factor s (1 followed by number of zeroes equal to max number of decimals)
    -- update n to the number of original observations minus the specified percentage.
    set s = cast(rpad('1', '0', z + 1) as unsigned)
    ,   n = (n - 2 * m)
    ;
    -- add sum of integer parts to the (downscaled) sum of decimal parts, and divide to get the mean.
    return (v + case d when 0 then 0 else (d - n * s) / s end) / n;
end
//

delimiter ;
-- ------------------------------- END FUNCTION ---------------------------------