Question

我具有以下功能，该功能可以从数据框中移除高于或低于给定列attr的平均值的n个标准偏差：

remove_outliers <- function(df,attr,n){
  outliersgone <- df[df[,attr]<=(mean(df[,attr],na.rm=TRUE)+n*sd(df[,attr],na.rm=TRUE)) & df[,attr]>=(mean(df[,attr],na.rm=TRUE)-n*sd(df[,attr],na.rm=TRUE)),]
  return(outliersgone)
}

现在，实际上，我对像iris这样的数据框感兴趣，最好首先对Species进行分组，然后除去异常值。也就是说，对于所选setosa，删除所有attr物种的实例，这些实例的均值高于或低于该物种的平均值。

我尝试使用dplyr进行如下编码：

remove_outliers_grouped <- function(df,attr,n,Factor=NULL){
  outliers_grouped_gone <- as.data.frame(df %>% group_by_at(Factor)) %>% remove_outliers(.,attr,n)
  return(outliers_grouped_gone)
}

现在我们可以尝试

irisG <- remove_outliers_grouped(df=iris, attr="Petal.Length", n=1.2, Factor="Species")
irisG2 <- remove_outliers_grouped(df=iris, attr="Petal.Length", n=1.2)

但是两个数据帧是相同的。即，尽管未显示任何错误消息，但即使我提供了Factor="Species"作为参数，因子分组也没有发生。我该如何解决？

（坚持R和dplyr基的解决方案是理想选择...）

Answer 1

我将通过以下方式重新编写函数，整理一些表达式（您可以使用... <= ... & ... >= ...更简洁地编写abs表达式），并允许带引号和不带引号的列名（使用非标准评估）：

remove_outlier <- function(df, var, n) {
    var <- rlang::parse_expr(quo_name(enquo(var)))
    df %>% filter(abs(!!var - mean(!!var, na.rm = TRUE)) <= n * sd(!!var, na.rm = TRUE))
}

非分组用例看起来像这样

iris %>% remove_outlier(Petal.Length, 0.1)
# Or quoted: iris %>% remove_outlier("Petal.Length", 0.1)
#  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#1          5.2         2.7          3.9         1.4 versicolor
#2          5.6         2.9          3.6         1.3 versicolor
#3          5.6         2.5          3.9         1.1 versicolor
#4          5.5         2.4          3.8         1.1 versicolor
#5          5.5         2.4          3.7         1.0 versicolor
#6          5.8         2.7          3.9         1.2 versicolor

在功能之外进行分组

iris %>% group_by(Species) %>% remove_outlier(Petal.Length, 0.1) %>% ungroup()
## A tibble: 11 x 5
#  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#1          6.4         2.9          4.3         1.3 versicolor
#2          6.2         2.9          4.3         1.3 versicolor
#3          6.3         2.9          5.6         1.8 virginica 
#4          6.8         3            5.5         2.1 virginica 
#5          6.5         3            5.5         1.8 virginica 
#6          6.4         2.8          5.6         2.1 virginica 
#7          6.4         2.8          5.6         2.2 virginica 
#8          6.1         2.6          5.6         1.4 virginica 
#9          6.3         3.4          5.6         2.4 virginica 
#10          6.4         3.1          5.5         1.8 virginica 
#11          6.7         3.1          5.6         2.4 virginica

返回tibble包含原始数据，其中按组删除了异常值。

更新1

评论1

要删除rlang的依赖关系，我们可以定义一个仅使用带引号的列名的函数，然后使用sym将其转换为符号； sym由dplyr导出，因此我们不需要显式的library(rlang)。

library(dplyr)
remove_outlier_quoted <- function(df, var, n) {
    df %>% filter(abs(!!sym(var) - mean(!!sym(var), na.rm = TRUE)) <= n * sd(!!sym(var), na.rm = TRUE))
}
iris %>% remove_outlier_quoted("Petal.Length", 0.1)

评论2

就个人而言，我将分组保留在函数之外； IMO这将是更dplyr规范的方式（例如，没有summarise_grouped函数，相反，dplyr需要显式使用group_by和summarise），而且可能更容易阅读。但我想这归结为偏好问题。您可以这样做：

library(dplyr)
library(rlang)
remove_outlier_grouped <- function(df, var, group, n) {
    var <- rlang::parse_expr(quo_name(enquo(var)))
    group <- rlang::parse_expr(quo_name(enquo(group)))
    df %>%
        group_by(!!group) %>%
        filter(abs(!!var - mean(!!var, na.rm = TRUE)) <= n * sd(!!var, na.rm = TRUE)) %>%
        ungroup()
}
iris %>% remove_outlier_grouped(Petal.Length, Species, 0.1)
# Or: iris %>% remove_outlier_grouped("Petal.Length", "Species", 0.1)

var的{{1}}和group参数都可以加引号或不加引号。要删除remove_outlier_grouped依赖项，请参阅“重新注释1”。

更新2

要解决您的最后评论，这里是一个选项，其中定义了两个功能。在一个假设的R包中，“重载器” rlang不会导出，而包装函数remove_outlier会导出。所有参数检查均在remove_outlier_grouped中进行。在此示例中，我使用NSE，即函数参数必须被取消引用，并在remove_outlier_grouped中使用enquo进行混淆。（如果您想同时使用带引号和不带引号的参数，则在引用和评估时要格外小心，这会使该示例变得不必要地复杂。您可以参阅我的原始帖子了解该操作的一般情况。）

remove_outlier_grouped

Answer 2

预先：

删除as.data.frame，添加do。
从df[,attr]更改为df[[attr]]。

此答案尝试尽可能保留您的功能。话虽如此，从长远来看，我认为@MauritsEvers建议改用rlang风格的编程方式访问dplyr功能的建议可能更好。它有一点学习曲线，但是从长远来看会有所收获（例如，使用dplyr可以更直观，更正确地运行功能）。

首先：停止使用`as.data.frame`

...并添加do，以便在remove_outliers函数外部正确处理分组。

如果我debug(remove_outliers_grouped)，然后运行第一个代码，我会看到：

debug(remove_outliers_grouped)
remove_outliers_grouped(df=iris, attr="Petal.Length", n=1.2, Factor="Species")
# debugging in: remove_outliers_grouped(df = iris, attr = "Petal.Length", n = 1.2, 
#     Factor = "Species")
# debug at #1: {
#     outliers_grouped_gone <- as.data.frame(df %>% group_by_at(Factor)) %>% 
#         remove_outliers(., attr, n)
#     return(outliers_grouped_gone)
# }
# Browse[2]> 

df %>%
  group_by_at(Factor) %>%
  attributes(.)
# $names
# [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
# $row.names
#   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29
#  [30]  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58
#  [59]  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87
#  [88]  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
# [117] 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
# [146] 146 147 148 149 150
# $groups
# # A tibble: 3 x 2
#   Species          .rows
# * <fct>      <list<int>>
# 1 setosa            [50]
# 2 versicolor        [50]
# 3 virginica         [50]
# $class
# [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"
# Browse[2]> 

as.data.frame(df %>%
  group_by_at(Factor)) %>%
  attributes(.)
# $class
# [1] "data.frame"
# $row.names
#   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29
#  [30]  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58
#  [59]  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87
#  [88]  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
# [117] 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
# [146] 146 147 148 149 150
# $names
# [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

请注意，第一个示例中$groups组件是如何存在的，而第二个示例中缺失了do组件？那是as.data.frame做事情所需要的。

这将产生以下功能（删除do并添加remove_outliers_grouped <- function(df,attr,n,Factor=NULL){ outliers_grouped_gone <- df %>% group_by_at(Factor) %>% do(remove_outliers(.,attr,n)) return(outliers_grouped_gone) }）：

remove_outliers

这可以工作，但是您的data.frame函数仅假设tbl_df，而不假定as.data.frame。我知道这可能就是您考虑使用df[,attr]的原因，所以

第二：停止使用`df[[attr]]`

使用df[,attr]代替as.data.frame或将df[,attr]移至第一个函数内的。我将说明为什么tbl_df失败（使用debug(remove_outliers) remove_outliers_grouped(df=iris, attr="Petal.Length", n=1.2, Factor="Species") # debugging in: remove_outliers(., attr, n) # debug at #1: { # outliersgone <- df[df[, attr] <= (mean(df[, attr], na.rm = TRUE) + # n * sd(df[, attr], na.rm = TRUE)) & df[, attr] >= (mean(df[, # attr], na.rm = TRUE) - n * sd(df[, attr], na.rm = TRUE)), # ] # return(outliersgone) # } # Browse[2]> df[,attr] # # A tibble: 50 x 1 # Petal.Length # <dbl> # 1 1.4 # 2 1.4 # 3 1.3 # 4 1.5 # 5 1.4 # 6 1.7 # 7 1.4 # 8 1.5 # 9 1.4 # 10 1.5 # # ... with 40 more rows # Browse[2]> as.data.frame(df)[,attr] # [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 # [30] 1.6 1.6 1.5 1.5 1.4 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 # Browse[2]> df[[attr]] # [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 # [30] 1.6 1.6 1.5 1.5 1.4 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4）并且两个选项都起作用。

data.frame

这实际上是函数尝试与tbl_df，data.table和df[[attr]]互换使用的函数的常见问题...

在使用df <- as.data.frame(df)和remove_outliers <- function(df,attr,n){ outliersgone <- df[df[[attr]]<=(mean(df[[attr]],na.rm=TRUE)+n*sd(df[[attr]],na.rm=TRUE)) & df[[attr]]>=(mean(df[[attr]],na.rm=TRUE)-n*sd(df[[attr]],na.rm=TRUE)),] return(outliersgone) } remove_outliers_grouped(df=iris, attr="Petal.Length", n=1.2, Factor="Species") # # A tibble: 121 x 5 # # Groups: Species [3] # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # <dbl> <dbl> <dbl> <dbl> <fct> # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5 3.6 1.4 0.2 setosa # 6 4.6 3.4 1.4 0.3 setosa # 7 5 3.4 1.5 0.2 setosa # 8 4.4 2.9 1.4 0.2 setosa # 9 4.9 3.1 1.5 0.1 setosa # 10 5.4 3.7 1.5 0.2 setosa # # ... with 111 more rows（然后是其余未更改的代码）之间，我敦促第一个（第二个在来回转换方面有更多的开销，并且知道如何访问安全的数据，而不必复制它并更改类和各种各样的事情……是能够做的一件好事。）

所以我们将第一个函数更改为：

df[,attr]

（如果您很好奇，df[,attr,drop=TRUE]最终是tbl_df，这是一种默认行为，既使我感到沮丧，又与其他类似框架的对象{{1} }和data.table。您可以使用df[,attr,drop=FALSE]模仿单列保留的行为，这是我在很多个人代码和程序包中都采用的方法。）

一个也许更简单的单功能实现：

remove_outliers2 <- function(df, attr, n) { mu <- mean(df[[attr]], na.rm=TRUE) sigma <- sd(df[[attr]], na.rm=TRUE) df[ between(df[[attr]], mu - n*sigma, mu + n*sigma), ] } iris %>% group_by(Species) %>% do(remove_outliers2(., attr = "Petal.Length", n = 1.2)) %>% ungroup()

或者如果您真的想要单一功能，这是一个黑客

remove_outliers3 <- function(df, attr, n, Factor) { if (!missing(Factor)) { group_by_at(df, Factor) %>% do(remove_outliers3(., attr = attr, n = n)) } else { mu <- mean(df[[attr]], na.rm=TRUE) sigma <- sd(df[[attr]], na.rm=TRUE) df[ between(df[[attr]], mu - n*sigma, mu + n*sigma), ] } } remove_outliers3(iris, "Petal.Length", n = 1.2, Factor = "Species")

尽管它的递归性质很好，但我认为我更喜欢以前的版本，因为它很简单。

删除按R因子分组的离群值

2 个答案:

更新1

评论1

评论2

更新2

首先：停止使用`as.data.frame`

第二：停止使用`df[[attr]]`

删除按R因子分组的离群值

2 个答案:

更新1

评论1

评论2

更新2

首先：停止使用as.data.frame

第二：停止使用df[[attr]]

首先：停止使用`as.data.frame`

第二：停止使用`df[[attr]]`