用附加列标记R中数据集中的离群值

时间:2018-12-31 07:58:33

标签: r outliers

我有一个数据集,我想创建一个附加列,并希望标记离群值(大于IQR的1.5倍)。我目前正在使用此代码:

    #Add additional column for flagging outliers that are beyond  1.5*interquartile range


    plotdata$OUTLIERFLAG <- 0
   #Cycle through variables
    for (i in 1: length(unique(plotdata$variable))){
    pms <- unique(plotdata$variable)[i]
    dats <- subset(plotdata, plotdata$variable ==pms)
    #Cycle through Sampling locations
    for (bore in unique(plotdata$Sample.Point)){
    subdats <- dats[dats$Sample.Point==bore,]
    x1 <- match(boxplot.stats(subdats$value2)$out, subdats$value2)
    ifelse(x1==0, NULL, plotdata[rownames(subdats[x1,]),]$OUTLIERFLAG <- 1)
    }
    }

但是,有时代码无法正常工作。对于相同的值,我将其中一个标记为离群值,而将另一个标记为离群值。 请帮助

1 个答案:

答案 0 :(得分:1)

由于您未提供任何数据,因此我将使用mtcars数据集。您可能想将离群值定义为Q3 + IQR * 1.5以上的数据点。同样,对于基本的R操作,通常避免for循环。

df <- mtcars[, c(2, 4)]
df$outliers <- ifelse(test = df$hp > quantile(df$hp, probs = 0.75) + IQR(df$hp) * 1.5, yes = "FLAG", no = NA)
df

> df
                    cyl  hp outliers
Mazda RX4             6 110     <NA>
Mazda RX4 Wag         6 110     <NA>
Datsun 710            4  93     <NA>
Hornet 4 Drive        6 110     <NA>
Hornet Sportabout     8 175     <NA>
Valiant               6 105     <NA>
Duster 360            8 245     <NA>
Merc 240D             4  62     <NA>
Merc 230              4  95     <NA>
Merc 280              6 123     <NA>
Merc 280C             6 123     <NA>
Merc 450SE            8 180     <NA>
Merc 450SL            8 180     <NA>
Merc 450SLC           8 180     <NA>
Cadillac Fleetwood    8 205     <NA>
Lincoln Continental   8 215     <NA>
Chrysler Imperial     8 230     <NA>
Fiat 128              4  66     <NA>
Honda Civic           4  52     <NA>
Toyota Corolla        4  65     <NA>
Toyota Corona         4  97     <NA>
Dodge Challenger      8 150     <NA>
AMC Javelin           8 150     <NA>
Camaro Z28            8 245     <NA>
Pontiac Firebird      8 175     <NA>
Fiat X1-9             4  66     <NA>
Porsche 914-2         4  91     <NA>
Lotus Europa          4 113     <NA>
Ford Pantera L        8 264     <NA>
Ferrari Dino          6 175     <NA>
Maserati Bora         8 335     FLAG
Volvo 142E            4 109     <NA>

仅有8缸,335马力的玛莎拉蒂宝来(Maserati Bora)。表示异常数据点的箱须图:

boxplot(df$hp, horizontal = TRUE)
# Vertical line indicating the outlier limit
abline(v = quantile(df$hp, probs = 0.75) + IQR(df$hp) * 1.5, col = "red")  # 305.25

enter image description here