我有一个删除异常值detectaOutliers()
的功能,但我的功能不会删除所有异常值。
有人可以帮我找错吗?
detectaOutliers = function(x) {
q = quantile(x, probs = c(0.25, 0.75))
R = IQR(x)
OM1 = q[1] - (R * 1.5) # outliers moderados
OM3 = q[2] + (R * 1.5)
OE1 = q[1] - (R * 3) # outliers extremos
OE3 = q[2] + (R * 3)
moderados = ifelse(x < OM1 | x > OM3, 1, 0)
extremos = ifelse(x < OE1 | x > OE3, 1, 0)
cbind(extOut = moderados)
}
cepas = unique(AbsExtSin$Cepa)
concs = unique(AbsExtSin$Concen)
outliers = NULL
for (cepa in cepas) {
for (concen in concs) {
datosOE = subset(AbsExtSin, Cepa == cepa & Concen == concen)
outs = detectaOutliers(datosOE$Abs)
datosOE = cbind(datosOE, outs)
outliers = rbind(outliers, datosOE)
}
}
AbsExtSin = subset(outliers, extOut == 0)[, 1:5]
答案 0 :(得分:2)
<强>答案强>: 我假设您的问题如下:首先,您检测异常值(就像boxplot函数一样)并删除它们。然后,您使用已清理的数据生成箱线图,再次显示异常值。而且你希望看到没有异常值。
这不一定是您的代码错误,这是您的期望错误。删除异常值时,数据集的统计信息会发生变化。例如,四分位数不再相同。因此,您可以识别&#34; new&#34;异常值。请参阅以下示例:
## create example data
set.seed(12345)
rand <- rexp(100,23)
## plot. gives outliers.
boxplot(rand)
## detect outliers with these functions
detectaOutliers = function(x) {
q = quantile(x, probs = c(0.25, 0.75))
R = IQR(x)
OM1 = q[1] - (R * 1.5) # outliers moderados
OM3 = q[2] + (R * 1.5)
OE1 = q[1] - (R * 3) # outliers extremos
OE3 = q[2] + (R * 3)
moderados = ifelse(x < OM1 | x > OM3, 1, 0)
extremos = ifelse(x < OE1 | x > OE3, 1, 0)
cbind(extOut = moderados)
}
detectOut <- function(x) boxplot(x, plot = FALSE)$out
## clean your data
clean1 <- rand[!as.logical(detectaOutliers(rand))]
clean2 <- rand[!rand%in%detectOut(rand)]
## check that these functions do the same.
all(clean1 == clean2 )
# Fun fact: depending on your data, clean1 and clean2
# are not always the same. See the extra note below.
## plot cleaned data
boxplot(clean2)
## Still has outliers. But "new" ones. confirm with:
sort(boxplot(rand)$out) # original outlier
sort(boxplot(clean2)$out) # new outlier
注1: 您的代码不一定使用与R中的boxplot函数相同的异常值标识(我不确定ggplot boxplot,但至少对于graphics :: boxplot函数来说是真的。):
## The boxplot function (rather: boxplot.stats)
## does not use the quantile function, but the fivenum function
## to identify outliers. They produce different results, e.g., here:
fivenum(rand)[c(2,4)]
quantile(rand,probs=c(0.25,0.75))
注2 :
如果您想要排除异常值的箱图,可以使用boxplot函数的outline
参数(对于ggplot,请参阅Ignore outliers in ggplot2 boxplot)
答案 1 :(得分:0)
6个小时后,我意识到错误出现在我使用的变量中(我的数据库有4个变量,我需要单独删除列的异常值,这取决于其他两个变量,结果证明我错了我选择了2)最后,我意识到这个功能完美无缺!
我感到不便,非常感谢所有人