R异常值起作用

时间:2018-03-21 11:28:45

标签: r function boxplot outliers

我有一个删除异常值detectaOutliers()的功能,但我的功能不会删除所有异常值。

有人可以帮我找错吗?

detectaOutliers = function(x) {
  q = quantile(x, probs = c(0.25, 0.75))
  R = IQR(x)
  OM1 = q[1] - (R * 1.5)  # outliers moderados
  OM3 = q[2] + (R * 1.5)
  OE1 = q[1] - (R * 3)    # outliers  extremos
  OE3 = q[2] + (R * 3)

  moderados = ifelse(x < OM1 | x > OM3, 1, 0)  
  extremos  = ifelse(x < OE1 | x > OE3, 1, 0)  
  cbind(extOut = moderados)
}


cepas = unique(AbsExtSin$Cepa)
concs = unique(AbsExtSin$Concen)
outliers = NULL
for (cepa in cepas) {
    for (concen in concs) {
      datosOE = subset(AbsExtSin, Cepa == cepa & Concen == concen)
      outs = detectaOutliers(datosOE$Abs)

      datosOE  = cbind(datosOE, outs)
      outliers = rbind(outliers, datosOE)
    }
}
AbsExtSin = subset(outliers, extOut == 0)[, 1:5]

这是没有异常值的数据(我删除了11个异常值,但我有更多) enter image description here

2 个答案:

答案 0 :(得分:2)

<强>答案: 我假设您的问题如下:首先,您检测异常值(就像boxplot函数一样)并删除它们。然后,您使用已清理的数据生成箱线图,再次显示异常值。而且你希望看到没有异常值。

这不一定是您的代码错误,这是您的期望错误。删除异常值时,数据集的统计信息会发生变化。例如,四分位数不再相同。因此,您可以识别&#34; new&#34;异常值。请参阅以下示例:

## create example data
set.seed(12345)
rand <- rexp(100,23)
## plot. gives outliers.
boxplot(rand)
## detect outliers with these functions
detectaOutliers = function(x) {
  q = quantile(x, probs = c(0.25, 0.75))
  R = IQR(x)
  OM1 = q[1] - (R * 1.5)  # outliers moderados
  OM3 = q[2] + (R * 1.5)
  OE1 = q[1] - (R * 3)    # outliers  extremos
  OE3 = q[2] + (R * 3)

  moderados = ifelse(x < OM1 | x > OM3, 1, 0)  
  extremos  = ifelse(x < OE1 | x > OE3, 1, 0)  
  cbind(extOut = moderados)
}
detectOut <- function(x) boxplot(x, plot = FALSE)$out
## clean your data
clean1 <- rand[!as.logical(detectaOutliers(rand))]
clean2 <- rand[!rand%in%detectOut(rand)]
## check that these functions do the same.
all(clean1  == clean2 )
# Fun fact: depending on your data, clean1 and clean2
# are not always the same. See the extra note below.
## plot cleaned data
boxplot(clean2)
## Still has outliers. But "new" ones. confirm with:
sort(boxplot(rand)$out) # original outlier
sort(boxplot(clean2)$out) # new outlier

注1: 您的代码不一定使用与R中的boxplot函数相同的异常值标识(我不确定ggplot boxplot,但至少对于graphics :: boxplot函数来说是真的。):

## The boxplot function (rather: boxplot.stats)
## does not use the quantile function, but the fivenum function
## to identify outliers. They produce different results, e.g., here:
fivenum(rand)[c(2,4)]
quantile(rand,probs=c(0.25,0.75))

注2 : 如果您想要排除异常值的箱图,可以使用boxplot函数的outline参数(对于ggplot,请参阅Ignore outliers in ggplot2 boxplot

答案 1 :(得分:0)

6个小时后,我意识到错误出现在我使用的变量中(我的数据库有4个变量,我需要单独删除列的异常值,这取决于其他两个变量,结果证明我错了我选择了2)最后,我意识到这个功能完美无缺!

我感到不便,非常感谢所有人