我正在将cakephp用于允许用户记录其物质使用的网站。
我想要的是获得特定物质的平均剂量。我通过设置虚拟场
完成了这项工作 $this->RecordDrugUnit->virtualFields['sum'] ='AVG(RecordDrugUnit.dose)';
问题是,如果一个用户搞砸了并且有一个混乱的值,如100000克酒精,那么这将搞砸平均值。所以我想排除异常值,或以某种方式找出一种更好的方法来收集平均值。
有人对此有任何意见吗?
答案 0 :(得分:1)
您可以通过排除距标准偏差太远的变量来做到这一点,其中3 *标准差与平均值通常被认为是“异常值”。如果您只想将非常排除在平均值之外,则可以增加stddev乘以的数量。这是一个非常简化的,未经优化的方法,您可以将其用作您选择使用的虚拟文件的起点:
mysql> select * from test;
+------+
| a |
+------+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 1000 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+------+
11 rows in set (0.00 sec)
mysql> select * from test where (ABS(test.a - (select avg(a) from test)) < 3*(select stddev(a) from test));
+------+
| a |
+------+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+------+
10 rows in set (0.00 sec)
我相信如果虚拟字段包含一个选择,那么只需运行该选择,因此您可以直接使用它。我在虚拟领域的快速,未经测试的尝试:
$this->RecordDrugUnit->virtualFields['sum'] = 'select AVG(rdu.dose) from RecordDrugUnit rdu where (ABS(rdu.dose - (select avg(dose) from rdu)) < 3*(select stddev(dose) from rdu))';
答案 1 :(得分:0)
您可以使用trimmed mean功能。如果您有兴趣,我可以在今天晚些时候添加该程序。
-- Add trimmed mean user defined function
-- ------------------------------ START FUNCTION --------------------------------
drop function if exists trimmed_mean;
delimiter //
create function trimmed_mean(
-- data: comma separated list of numeric values, left-to-right sorted from low to high
-- (you can use GROUP_CONCAT(x ORDER BY x) to let MySQL generate suchs lists for you
data text
-- p: the percentage of the data points to trim.
, p tinyint
)
returns double
begin
-- n: number of observations
declare n int default 1 + length(data) - length(replace(data, ',', ''));
-- m: number of observations to remove on both ends of the data set
declare m int default n * p / 2 / 100;
-- t: trimmed dataset
declare t text default substring_index(substring_index(data, ',', n-m), ',', -(n-m-m));
-- current character (for parsing numbers out of the dataset)
declare c varchar(1);
-- x: integer part of the data point, y: decimal part of the data point
declare x, y varchar(32);
-- z: number of decimals
declare z int unsigned default 0;
-- number of characters in the (trimmed) data set
declare l int unsigned default length(t);
-- i: current position in the data set, j: marks start of data point
declare i, j int unsigned default 1;
-- the sum of the integer parts of the data points
declare v int default 0;
-- d: the sum of the decimal parts of the data parts (as scaled integer), s: scaling factor
declare d, s int unsigned default 0;
repeat
-- get the current character from the trimmed data set
set c = substring(t, i, 1);
-- check if current position is a data point separator (',') or end of data terminator ('')
if substring(t, i, 1) in (',', '') then
-- parse out a data point (from j up to i) into x; advance j and look for a decimal separator ('.')
set x = substring(t, j, i - j),
j = i + 1,
d = instr(x, '.')
;
-- if we have no decimals, then parse data point as integer and update our sum v with it.
if d = 0 then
set v = v + cast(x as signed);
else
-- we have decimals. Parse up to the decimal separator ('.') as integer and update our sum v.
-- parse out the part after the decimal separator into y. Update the total number of decimals to keep track of in z
-- Finally, pad our decimal parts sum with the number of decimals and prepend a 1 to not lose leading zeroes
set v = v + cast(substring_index(x, '.', 1) as signed)
, y = substring_index(x, '.', -1)
, z = greatest(z, length(y))
, d = cast(rpad(cast(d as char), '0', z + 1) as unsigned) + cast(rpad(concat('1', y), '0', z + 1) as unsigned)
;
end if;
end if;
-- advance position to look at next character in the dataset.
set i = i + 1;
-- stop scanning when we ran trhough the dataset.
until c = '' end repeat;
-- compute the scaling factor s (1 followed by number of zeroes equal to max number of decimals)
-- update n to the number of original observations minus the specified percentage.
set s = cast(rpad('1', '0', z + 1) as unsigned)
, n = (n - 2 * m)
;
-- add sum of integer parts to the (downscaled) sum of decimal parts, and divide to get the mean.
return (v + case d when 0 then 0 else (d - n * s) / s end) / n;
end
//
delimiter ;
-- ------------------------------- END FUNCTION ---------------------------------