计算一组特定列的离群值,然后识别具有> 5列且具有离群值的ID

时间:2020-03-31 00:39:34

标签: r loops for-loop subset outliers

我正在处理一个大数据框(df)。我想根据平均值+ 3 sd计算特定列子集的离群值。

我首先提取了想要的列,因此所有列名称中都带有颜色。

colors = colnames(df)[grep('color', colnames(df))]

我不确定接下来应该如何循环使用该新变量来计算所有列的离群值。我的公式是:

# id those with upper outliers
uthr = mean(df$color)+3*sd(df$color)
rm_u_ids = df$id[which(df$color >= uthr)]

# id those with lower outliers
lthr = mean(df$color)-3*sd(df$color)
rm_l_ids = df$id[which(df$color <= lthr)]

# remove those with both upper and lower outliers
rm_ids = sort(c(rm_u_ids, rm_l_ids))
df_2 = df %>% filter(!id %in% rm_ids)

现在是实际问题。 我想使用类似的方法进行以下操作: 1)对于colors中的每种颜色,请标识具有异常值的ID,也许可以将此信息保存在其他位置, 2)使用该信息(可能在列表或单独的数据框中),标识出现在5列或更多列或colors中的ID, 3)使用此列表将原始数据帧作为子集,因此我们用5个或更多颜色列中的离群值消除了这些ID。

这有意义吗?我不确定是否也建议针对此问题进行循环。

谢谢,抱歉,如果我说的听起来比应该的要复杂!

3 个答案:

答案 0 :(得分:2)

已经提供的聪明答案的替代方法是将相关列转换为矩阵并使用一些快速矩阵运算:

df = iris
colors = colnames(iris)[1:4]
m = as.matrix(df[,colors])

# Standardize the numeric values in each column
m = scale(m)

# Apply some outlier definition rules, e.g.
# detect measurements with |Zscore|>3
outliers = abs(m)>3
# detect rows with at least 5 such measurements
outliers = rowSums(outliers)
which(outliers>=5)

答案 1 :(得分:1)

您可以创建一个函数,以返回id的异常值

find_outlier <- function(df, x) {
  uthr = mean(x)+3*sd(x)
  rm_u_ids = df$id[which(x >= uthr)]
  # id those with lower outliers
  lthr = mean(x)-3*sd(x)
  rm_l_ids = df$id[which(x <= lthr)]
  # remove those with both upper and lower outliers
  unique(sort(c(rm_u_ids, rm_l_ids)))
}

将其应用于每个colors列,使用table计算其计数,并删除出现超过5次的id

all_ids <- lapply(df[colors], find_outlier, df = df)

temp_tab <- table(unlist(all_ids))
remove_ids <- names(temp_tab[temp_tab >= 5])
subset(df, !id %in% remove_ids)

答案 2 :(得分:1)

我将假设您的magic_square = [ [2, 7, 6], [9, 5, 1], [4, 3, 8] ] # functional/pythonic way sums = [sum(row) for row in magic_square] # [15, 15, 15] # if you want to make it longer def row_sums(square): return [sum(row) for row in square] row_sums(magic_square) # [15, 15, 15] #if you want to make it even longer def row_sums(square): sums = [] for row in square: row_total = 0 for number in row: row_total += number sums.append(row_total) return sums row_sums(magic_square) # [15, 15, 15] 仅包含您想要的数字变量

data.frame

enter image description here

findOutlierCols = function(color.df){
  hasOutliers = function(col){
    bds = mean(col) + c(-3,3)*sd(col)
    if(any(col <= bds[1]) || any(col >= bds[2])){
      return(TRUE)
    }else{
      return(FALSE)
    }
  }  
  apply(color.df, 2, hasOutliers)
}

## make some fake data
set.seed(123)
x = matrix(rnorm(1000), ncol = 10)
color.df = data.frame(x)
colnames(x) = paste0("color.", colors()[1:10])
color.df = apply(color.df, 2, function(col){col+rbinom(100, 5, 0.1)})

boxplot(color.df)
findOutlierCols(color.df)