我想创建一个从数据集中删除所有异常值的函数。我已经阅读了很多有关此问题的Stack Overflow文章,因此我意识到删除异常值的危险。这是我到目前为止所做的:
# Remove outliers from a column
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
# Removes all outliers from a data set
remove_all_outliers <- function(df){
# We only want the numeric columns
a<-df[,sapply(df, is.numeric)]
b<-df[,sapply(df, !is.numeric)]
a<-lapply(a,function(x) remove_outliers(x))
d<-merge(a,b)
d
}
我知道这有一些问题,但如果能有更好的处理,请纠正我。
!is.numeric()
不是一件事,我应该怎样做到这一点?
is.numeric==FALSE
is.numeric()
将因子转换为整数。我该如何阻止这种情况?lapply
对吗?答案 0 :(得分:4)
因素是整数,而不是原子整数。
# Remove outliers from a column
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
您可以按索引替换列,这样您就不需要创建单独的数据集。只需确保将相同的数据传递给lapply
,例如,您不想做data[, 1:3] <- lapply(data, FUN)
我做过多次。
# Removes all outliers from a data set
remove_all_outliers1 <- function(df){
# We only want the numeric columns
df[,sapply(df, is.numeric)] <- lapply(df[,sapply(df, is.numeric)], remove_outliers)
df
}
与上述类似(我认为稍微容易一些),您可以将整个数据集传递给lapply
。还要确保不要
data <- lapply(data, if (x) something else anotherthing)
或
data[] <- lapply(data, if (x) something)
这也是我多次犯过的错误
remove_all_outliers2 <- function(df){
df[] <- lapply(df, function(x) if (is.numeric(x))
remove_outliers(x) else x)
df
}
## test
mt <- within(mtcars, {
mpg <- factor(mpg)
gear <- letters[1:2]
})
head(mt)
identical(remove_all_outliers1(mt), remove_all_outliers2(mt))
# [1] TRUE
您的想法可以进行一些小的调整。 !is.numeric
可以作为Negate(is.numeric)
或更详细的function(x) !is.numeric(x)
或!sapply(x, is.numeric)
使用。通常情况下,function(function)
无法在开箱即用的情况下工作。
# Removes all outliers from a data set
remove_all_outliers <- function(df){
# We only want the numeric columns
## drop = FALSE in case only one column for either
a<-df[,sapply(df, is.numeric), drop = FALSE]
b<-df[,sapply(df, Negate(is.numeric)), drop = FALSE]
## note brackets
a[]<-lapply(a, function(x) remove_outliers(x))
## stack them back together, not merge
## you could merge if you had a unique id, one id per row
## then make sure the columns are returned in the original order
d<-cbind(a,b)
d[, names(df)]
}
identical(remove_all_outliers2(mt), remove_all_outliers(mt))
# [1] TRUE