我有一个数据框,其中包含22列的数值。当我对它执行summary(df)时,会得到详细信息(最小,最大,平均值,中位数,1和3四分位数)。现在,我想为每一列获取1和3四分位数。高于或低于此值将是一个离群值,我想用NA值替换离群值。
Summary :
Var 1 Var2 Var 3 Var 4
Min. : 0 Min. :0 Min : 0 Min : -127.00
1st Qu.: 1208 1st Qu.: 1150 1st Qu.: 135000 1st Qu.: 98
Median : 1400 Median : 1300 Median : 180000 Median : 99
Mean : 1617 Mean : 2138 Mean : 211759 Mean : 96.59
3rd Qu.: 1990 3rd Qu.: 2500 3rd Qu.: 250000 3rd Qu.: 100
Max. :10000 Max :4000 Max. :40000 Max:9999.
这不是一个重复的问题,因为我们没有明确地固定在四分位数范围内,而是从数据本身派生了值
答案 0 :(得分:0)
漫长而评论的方法,有成千上万个
### take the Q1 - Q3 values (you could also use quantile function where you can choose methods to get quantile)
q1 <- as.numeric(summary(old_vector)[2])
q3 <- as.numeric(summary(old_vector)[5])
new_vector <- vector()
for (value in old_vector) {
if ( !is.na(value) && (value < q1 || value > q3) ) new_vector <- append(new_vector, NA)
else new_vector <- append(new_vector, value)
}
根据您的评论进行了编辑:
当然可以使用以下结构:
### your DF
df1 <- structure(list(Var1 = c(100.2, 110, 200, 456, 120000), var2 = c(NA, 4545, 45465, 44422, 250000), var3 = c(NA, 210000, 91500, 215000, 250000), var4 = c(0.983, 0.44, 0.983, 0.78, 2.23)), class = "data.frame", row.names = c(NA, -5L))
### declare the function to replace a vector outliers based on IQR boundaries
replace_outliers <- function (old_vector) {
q1 <- as.numeric(summary(old_vector)[2])
q3 <- as.numeric(summary(old_vector)[5])
new_vector <- vector()
for (value in old_vector) {
if ( !is.na(value) && (value < q1 || value > q3) ) new_vector <- append(new_vector, NA)
else new_vector <- append(new_vector, value)
}
return(new_vector)
}
### open loop on DF columns
for ( col in colnames(df1) ) {
### create new column name
name_new_col <- paste( col, "_replaced", sep = "" )
### put the replaced values in the new column
df1[,name_new_col] <- replace_outliers(df1[,col])
}
您将获得带有新列“ Var [n] _replaced”的DF,其中包含NA而不是IQR离群值