如何获取r中数据框中所有列的异常值

时间:2018-02-24 13:53:38

标签: r

我正在研究泛型函数,它将获取数据帧并返回数据帧中每个变量的所有异常值,然后将其删除。

 outliers <- function(dataframe){
   dataframe <- select_if(dataframe, is.numeric)
   for(i in 1:length(dataframe)){
   paste(names(dataframe)[i]) <- boxplot.stats(names(dataframe)[i])$out)

  }
}

我想输出各个变量中的所有异常值,然后最终从数据帧中删除所有异常值。

我可以按照

逐个删除
Clean_Data[!Clean_Data$House_Price %in% boxplot.stats(Clean_Data$House_Price)$out,]

您可以从Clean_Data = read.csv('http://ucanalytics.com/blogs/wp-content/uploads/2016/09/Regression-Clean-Data.csv')

获取数据

2 个答案:

答案 0 :(得分:5)

我们通过仅选择numeric列(select_if)创建一个函数,循环遍历这些列(map)并对不是异常值的元素进行子集化。这将输出为list的{​​{1}}。

vector

如果我们要保留所有其他列,请使用library(dplyr) library(tidyr) library(purrr) outlierremoval <- function(dataframe){ dataframe %>% select_if(is.numeric) %>% #selects on the numeric columns map(~ .x[!.x %in% boxplot.stats(.)$out]) #%>% # not clear whether we need to output as a list or data.frame # if it is the latter, the columns could be of different length # so we may use cbind.fill # { do.call(rowr::cbind.fill, c(., list(fill = NA)))} } outlierremoval(Clean_Data) 并使用map_if在末尾追加NA以创建data.frame输出。但是,这也会导致根据异常值的数量改变每列中行的位置

cbind.fill

更新

如果我们需要获取异常值,请在outlierremoval <- function(dataframe){ dataframe %>% map_if(is.numeric, ~ .x[!.x %in% boxplot.stats(.)$out]) %>% { do.call(rowr::cbind.fill, c(., list(fill = NA)))} %>% set_names(names(dataframe)) } res <- outlierremoval(Clean_Data) head(res) # X Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price #1 1 1 9796 5250 10703 1659 1961 Open CAT B 530 6649000 #2 2 2 8294 8186 12694 1461 1752 Not Provided CAT B 210 3982000 #3 3 3 11001 14399 16991 1340 1609 Not Provided CAT A 720 5401000 #4 4 4 8301 11188 12289 1451 1748 Covered CAT B 620 5373000 #5 5 5 10510 12629 13921 1770 2111 Not Provided CAT B 450 4662000 #6 6 6 6665 5142 9972 1442 1733 Open CAT B 760 4526000 步骤中从map

中提取outlier
boxplot.stats

或者用outliers <- function(dataframe){ dataframe %>% select_if(is.numeric) %>% map(~ boxplot.stats(.x)$out) } outliers(Clean_Data) 替换异常值(这也将保留行位置)

NA

答案 1 :(得分:0)

这里是我对 Heart Disease UCI 数据所做的

df <- as.data.frame(read.csv("heart.csv"))
boxplot(df)
findOutliers <- function(dataframe){
  dataframe %>%Heart Disease UCI
    select_if(is.numeric) %>% 
    map(~ boxplot.stats(.x)$out)
}
outliers <- findOutliers(df)
temp <- list()
for (col in names(outliers)) {
  outlier <- outliers[[col]]
  if (length(outlier) > 0) {
    temp[col] <- df[-which(df[[col]] %in% outlier),][col]
  } else {
    temp[col] <- df[col]
  }
}
boxplot(temp)

去除异常值之前

enter image description here

去除异常值后

enter image description here