假设我有以下示例数据集:
set.seed(20130828)
data <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)),
Y = c(runif(1000),
rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
使用下一个函数,我确定了离群值,它们是3 sd以外的观测值:
findOutlier <- function(data, cutoff = 3) {
sds <- apply(data, 2, sd, na.rm = TRUE)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
outliers <- findOutlier(data)
现在,我需要用NA替换所有异常值。我使用了以下功能:
OutliersToNA <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
它返回以下错误:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 4, 0, 1, 2, 3
您能否建议对功能进行任何改进以用NA代替异常值?
答案 0 :(得分:1)
我认为有一种更简单的方法可以做到这一点。您可以将查找和替换离群值组合在一个函数中,然后只需使用它来更新数据框列。让我知道这是否适合您-
library(dplyr) # for mutate_all()
# summary of input data
summary(data)
X Y Z
Min. :-19.774100 Min. :0.000264 Min. :-2.2037
1st Qu.: -0.716794 1st Qu.:0.235934 1st Qu.: 0.4144
Median : 0.007454 Median :0.484328 Median : 1.0390
Mean : -0.027200 Mean :0.516226 Mean : 1.0428
3rd Qu.: 0.702163 3rd Qu.:0.749178 3rd Qu.: 1.6435
Max. : 15.758520 Max. :4.346755 Max. : 4.4933
NA's :1 NA's :1 NA's :1
replaceOutlier <- function(x, cutoff = 3) {
x[abs(x) > cutoff*sd(x, na.rm = T)] <- NA_real_
x
}
result <- data %>%
mutate_all(replaceOutlier)
# summary of result data
summary(result)
X Y Z
Min. :-5.215726 Min. :0.000264 Min. :-2.2037
1st Qu.:-0.688045 1st Qu.:0.234386 1st Qu.: 0.3932
Median : 0.009348 Median :0.476328 Median : 0.9879
Mean : 0.014648 Mean :0.486287 Mean : 0.9571
3rd Qu.: 0.697789 3rd Qu.:0.737633 3rd Qu.: 1.5897
Max. : 4.047586 Max. :0.998272 Max. : 3.0065
NA's :17 NA's :18 NA's :37
这是一个更简洁的版本,这要感谢@andrew_reece-
data %>%
mutate_all(list(~if_else(abs(.) > cutoff*sd(., na.rm = T), NA_real_, .)))