使用R使用NA替换异常值的功能

时间:2019-06-05 00:10:14

标签: r outliers

假设我有以下示例数据集:

set.seed(20130828)
    data <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), 
    Y = c(runif(1000), 
    rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))

使用下一个函数,我确定了离群值,它们是3 sd以外的观测值:

findOutlier <- function(data, cutoff = 3) {
    sds <- apply(data, 2, sd, na.rm = TRUE)
    result <- mapply(function(d, s) {
        which(d > cutoff * s)
    }, data, sds)
    result
}

outliers <- findOutlier(data)

现在,我需要用NA替换所有异常值。我使用了以下功能:

OutliersToNA <- function(data, outliers) {
result <- mapply(function(d, o) {
    res <- d
    res[o] <- NA
    return(res)
}, data, outliers)
return(as.data.frame(result))

}

它返回以下错误:

 Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 4, 0, 1, 2, 3 

您能否建议对功能进行任何改进以用NA代替异常值?

1 个答案:

答案 0 :(得分:1)

我认为有一种更简单的方法可以做到这一点。您可以将查找和替换离群值组合在一个函数中,然后只需使用它来更新数据框列。让我知道这是否适合您-

library(dplyr) # for mutate_all()

# summary of input data
summary(data)

       X                    Y                  Z          
 Min.   :-19.774100   Min.   :0.000264   Min.   :-2.2037  
 1st Qu.: -0.716794   1st Qu.:0.235934   1st Qu.: 0.4144  
 Median :  0.007454   Median :0.484328   Median : 1.0390  
 Mean   : -0.027200   Mean   :0.516226   Mean   : 1.0428  
 3rd Qu.:  0.702163   3rd Qu.:0.749178   3rd Qu.: 1.6435  
 Max.   : 15.758520   Max.   :4.346755   Max.   : 4.4933  
 NA's   :1            NA's   :1          NA's   :1   


replaceOutlier <- function(x, cutoff = 3) {
  x[abs(x) > cutoff*sd(x, na.rm = T)] <- NA_real_
  x
}

result <- data %>%
  mutate_all(replaceOutlier)

# summary of result data
summary(result)

       X                   Y                  Z          
 Min.   :-5.215726   Min.   :0.000264   Min.   :-2.2037  
 1st Qu.:-0.688045   1st Qu.:0.234386   1st Qu.: 0.3932  
 Median : 0.009348   Median :0.476328   Median : 0.9879  
 Mean   : 0.014648   Mean   :0.486287   Mean   : 0.9571  
 3rd Qu.: 0.697789   3rd Qu.:0.737633   3rd Qu.: 1.5897  
 Max.   : 4.047586   Max.   :0.998272   Max.   : 3.0065  
 NA's   :17          NA's   :18         NA's   :37       

这是一个更简洁的版本,这要感谢@andrew_reece-

data %>% 
  mutate_all(list(~if_else(abs(.) > cutoff*sd(., na.rm = T), NA_real_, .)))