我正在寻找一种方法来计算一组给定值的有用平均值,这些值可能包含巨大的峰值。 (例如,21,54,34,14,20,300,23或1,1,1,1,200,1,100)当使用标准平均值计算时,尖峰会使物体掉落。
我研究了使用中位数,但这并没有真正给出理想的结果。
我想在T-SQL
中实现它有什么想法吗?
答案 0 :(得分:1)
使用中值过滤器:
SELECT AVG(value)
FROM (
SELECT TOP 1 value AS median
FROM (
SELECT TOP 50 PERCENT value
FROM mytable
ORDER BY
value
) q
ORDER BY
value DESC
) q
JOIN mytable m
ON ABS(LOG10(value) - LOG10(median)) <= @filter_level
答案 1 :(得分:1)
这样你就可以在计算结果之前取走最高和最低的25%。
declare @t table (col1 int)
insert @t
select 21 union all
select 54 union all
select 34 union all
select 14 union all
select 20 union all
select 300 union all
select 23 union all
select 1 union all
select 1 union all
select 1 union all
select 1 union all
select 200 union all
select 1 union all
select 100
select avg(col1) from (
select top 67 percent col1 from (
select top 75 percent col1 from @t order by col1
) a order by col1 desc) b
答案 2 :(得分:0)
GROUP BY
(例如,数字之间的差异不超过10次或任何其他日志基数)HAVING
)答案 3 :(得分:0)
这样做的危险在于你不能确定所有这些尖峰都是微不足道的,值得丢弃。一个人的吵闹声是另一个人的黑天鹅。
如果您担心大数值会不必要地扭曲您对数据的看法,那么您最好使用像中位数那样对异常值不太敏感的指标。计算比平均值更难,但它会给你一个不受峰值影响的中心性度量。
答案 4 :(得分:0)
您可以考虑使用像OVER / PARTITION BY这样的窗口函数。这将允许您微调特定行组内的排除(例如,按名称,日期或小时)。在这个例子中,我借用示例t-clausen.dk中的行并通过添加名称进行扩展,以便我们可以演示窗口化。
-- Set boundaries, like the TOP PERCENT used in the afore mentioned example
DECLARE @UBOUND FLOAT, @LBOUND FLOAT
SET @UBOUND = 0.8 --(80%)
SET @LBOUND = 0.2 --(20%)
--Build a CTE table
;WITH tb_example AS (
select [Val]=21,[fname]='Bill' union all
select 54,'Tom' union all
select 34,'Tom' union all
select 14,'Bill' union all
select 20,'Bill' union all
select 300,'Tom' union all
select 23,'Bill' union all
select 1,'Tom' union all
select 1,'Tom' union all
select 1,'Bill' union all
select 1,'Tom' union all
select 200,'Bill' union all
select 1,'Tom' union all
select 12,'Tom' union all
select 8,'Tom' union all
select 11,'Bill' union all
select 100,'Bill'
)
--Outer query applies criteria of your choice to remove spikes
SELECT fname,AVG(Val) FROM (
-- Inner query applies windowed aggregate values for outer query processing
SELECT *
,ROW_NUMBER() OVER (PARTITION BY fname order by Val) RowNum
,COUNT(*) OVER (PARTITION BY fname) RowCnt
,MAX(Val) OVER (PARTITION BY fname) MaxVal
,MIN(Val) OVER (PARTITION BY fname) MinVal
FROM tb_example
) TB
WHERE
-- You can use the bounds to eliminate the top and bottom 20%
RowNum BETWEEN (RowCnt*@LBOUND) and (RowCnt*@UBOUND) -- Limits window
-- Or you may chose to simply eliminate the Max and MIN values
OR (Val > MinVal AND Val < MaxVal) -- Removes Lowest and Highest values
GROUP BY fname
在这种情况下,我使用两个条件和AVG val by fname。但是天空是你用这种技术选择减轻尖峰的方式的极限。