箱线图异常值到表格中

时间:2021-03-09 15:37:43

标签: r outliers

我想知道如何尝试从 Boxplot$out 中取出异常值(返回数据中的异常值)并将它们放入显示它们所属类的表中,例如如果异常值来自“Van”、“Bus”、“Saab”等类。

我尝试使用 which() 函数,但这仅返回异常值的索引,而不返回类。我不知道如何将它放入表格中。

任何帮助将不胜感激!

library(reshape2)
vehData <-
  structure(
    list(
      Samples = 1:6,
      Comp = c(95L, 91L, 104L, 93L, 85L,
               107L),
      Circ = c(48L, 41L, 50L, 41L, 44L, 57L),
      D.Circ = c(83L,
                 84L, 106L, 82L, 70L, 106L),
      Rad.Ra = c(178L, 141L, 209L, 159L,
                 205L, 172L),
      Pr.Axis.Ra = c(72L, 57L, 66L, 63L, 103L, 50L),
      Max.L.Ra = c(10L,
                   9L, 10L, 9L, 52L, 6L),
      Scat.Ra = c(162L, 149L, 207L, 144L, 149L,
                  255L),
      Elong = c(42L, 45L, 32L, 46L, 45L, 26L),
      Pr.Axis.Rect = c(20L,
                       19L, 23L, 19L, 19L, 28L),
      Max.L.Rect = c(159L, 143L, 158L, 143L,
                     144L, 169L),
      Sc.Var.Maxis = c(176L, 170L, 223L, 160L, 241L, 280L),
      Sc.Var.maxis = c(379L, 330L, 635L, 309L, 325L, 957L),
      Ra.Gyr = c(184L,
                 158L, 220L, 127L, 188L, 264L),
      Skew.Maxis = c(70L, 72L, 73L,
                     63L, 127L, 85L),
      Skew.maxis = c(6L, 9L, 14L, 6L, 9L, 5L),
      Kurt.maxis = c(16L,
                     14L, 9L, 10L, 11L, 9L),
      Kurt.Maxis = c(187L, 189L, 188L, 199L,
                     180L, 181L),
      Holl.Ra = c(197L, 199L, 196L, 207L, 183L, 183L),
      Class = c("van", "van", "saab", "van", "bus", "bus")
    ),
    row.names = c(NA,
                  6L), class = "data.frame")

#Remove outliers 
removeOutliers <- function(data) {
  OutVals <- boxplot(data)$out
  remOutliers <- sapply(data, function(x) x[!x %in% OutVals])
  return (remOutliers)
}
 
vehDataRemove1 <- vehData[, -1]
vehDataRemove2 <- vehDataRemove1[,-19]
vehData <- vehDataRemove2
vehClass <- vehData$Class

boxplot(vehData)
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData)
removeOutliers2 <- removeOutliers(removeOutliers1)

1 个答案:

答案 0 :(得分:1)

这可以简化。从您的数据框 vehData 开始。首先获取异常值的行号。在我的评论中,我不小心遗漏了 seq() 函数:

vehDataRemove <- vehData[, -c(1, 20)]
OutVals <- boxplot(vehDataRemove)
idx <- sapply(seq(length(OutVals$out)), function(x) which(vehDataRemove[, OutVals$group[x]] == OutVals$out[x]))
idx
# [1] 5 5 6 5 3

请注意,三个异常值位于第 5 行。现在删除带有异常值的行:

NoOuts <- vehDataRemove[-unique(idx), ]
NoOuts
#   Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong Pr.Axis.Rect Max.L.Rect Sc.Var.Maxis Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis Kurt.Maxis Holl.Ra
# 1   95   48     83    178         72       10     162    42           20        159          176          379    184         70          6         16        187     197
# 2   91   41     84    141         57        9     149    45           19        143          170          330    158         72          9         14        189     199
# 4   93   41     82    159         63        9     144    46           19        143          160          309    127         63          6         10        199     207

所以你丢失了一半的数据!或者将异常值设置为缺失值:

Outs2NA <- vehDataRemove
Outs2NA[cbind(idx, OutVals$group)] <- NA
Outs2NA
#   Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong Pr.Axis.Rect Max.L.Rect Sc.Var.Maxis Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis Kurt.Maxis Holl.Ra
# 1   95   48     83    178         72       10     162    42           20        159          176          379    184         70          6         16        187     197
# 2   91   41     84    141         57        9     149    45           19        143          170          330    158         72          9         14        189     199
# 3  104   50    106    209         66       10     207    32           23        158          223          635    220         73         NA          9        188     196
# 4   93   41     82    159         63        9     144    46           19        143          160          309    127         63          6         10        199     207
# 5   85   44     70    205         NA       NA     149    45           19        144          241          325    188         NA          9         11        180     183
# 6  107   57    106    172         50       NA     255    26           28        169          280          957    264         85          5          9        181     183