Question

我想通过一些分组变量来识别我的数据集中的异常值。因此，我想创建/更改一个额外的列异常值，它将FALSE / TRUE作为相应值的行。在这里，我只想包含数字变量。

    library(AER)

    # Load Data
    data("CigarettesSW")

    head(CigarettesSW)

    # state year   cpi population packs    income  tax  price  taxs             
    # 1    AL 1985 1.076    3973000 116.5  46014968 32.5 102.18 33.35
    # 2    AR 1985 1.076    2327000 128.5  26210736 37.0 101.47 37.00
    # 3    AZ 1985 1.076    3184000 104.5  43956936 31.0 108.58 36.17
    # 4    CA 1985 1.076   26444000 100.4 447102816 26.0 107.84 32.10
    # 5    CO 1985 1.076    3209000 113.0  49466672 31.0  94.27 31.00
    # 6    CT 1985 1.076    3201000 109.3  60063368 42.0 128.02 51.48



    # Custom function
    is_outlier <- function(x) {
      return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
    }


    R> CigarettesSW %>% group_by(state) %>% mutate(outlier = lapply(., is_outlier))
    Error in mutate_impl(.data, dots) : factors are not allowed

这里我试图只传递数字变量。

    R> CigarettesSW %>% group_by(state) %>% mutate_at(3:9, outlier = lapply(., is_outlier))
    Error in quantile.default(x, 0.25) : factors are not allowed

但是，这似乎会返回错误。不知道如何以不同的方式解决它。

Answer 1

使用lapply时，您不需要mutate_at循环显示列;只需指定需要应用于所有列的函数：

CigarettesSW %>% group_by(state) %>% mutate_at(3:8, funs(outlier = is_outlier(.)))

# A tibble: 96 x 15
# Groups:   state [48]
#    state   year   cpi population    packs    income   tax     price     taxs population_outlier packs_outlier
#   <fctr> <fctr> <dbl>      <dbl>    <dbl>     <dbl> <dbl>     <dbl>    <dbl>              <lgl>         <lgl>
# 1     AL   1985 1.076    3973000 116.4863  46014968  32.5 102.18167 33.34834              FALSE         FALSE
# 2     AR   1985 1.076    2327000 128.5346  26210736  37.0 101.47500 37.00000              FALSE         FALSE
# 3     AZ   1985 1.076    3184000 104.5226  43956936  31.0 108.57875 36.17042              FALSE         FALSE
# 4     CA   1985 1.076   26444000 100.3630 447102816  26.0 107.83734 32.10400              FALSE         FALSE
# 5     CO   1985 1.076    3209000 112.9635  49466672  31.0  94.26666 31.00000              FALSE         FALSE
# 6     CT   1985 1.076    3201000 109.2784  60063368  42.0 128.02499 51.48333              FALSE         FALSE
# 7     DE   1985 1.076     618000 143.8511   9927301  30.0 102.49166 30.00000              FALSE         FALSE
# 8     FL   1985 1.076   11352000 122.1811 166919248  37.0 115.29000 42.49000              FALSE         FALSE
# 9     GA   1985 1.076    5963000 127.2346  78364336  28.0  97.02517 28.84183              FALSE         FALSE
#10     IA   1985 1.076    2830000 113.7456  37902896  34.0 101.84200 37.91700              FALSE         FALSE
# ... with 86 more rows, and 4 more variables: income_outlier <lgl>, tax_outlier <lgl>, price_outlier <lgl>,
#   taxs_outlier <lgl>

或者要将该函数应用于所有数字列，您可以将mutate_if与is.numeric一起用作谓词：

CigarettesSW %>% group_by(state) %>% mutate_if(is.numeric, funs(outlier = is_outlier(.)))
# A tibble: 96 x 16
# Groups:   state [48]
#    state   year   cpi population    packs    income   tax     price     taxs cpi_outlier population_outlier packs_outlier
#   <fctr> <fctr> <dbl>      <dbl>    <dbl>     <dbl> <dbl>     <dbl>    <dbl>       <lgl>              <lgl>         <lgl>
# 1     AL   1985 1.076    3973000 116.4863  46014968  32.5 102.18167 33.34834       FALSE              FALSE         FALSE
# 2     AR   1985 1.076    2327000 128.5346  26210736  37.0 101.47500 37.00000       FALSE              FALSE         FALSE
# 3     AZ   1985 1.076    3184000 104.5226  43956936  31.0 108.57875 36.17042       FALSE              FALSE         FALSE
# 4     CA   1985 1.076   26444000 100.3630 447102816  26.0 107.83734 32.10400       FALSE              FALSE         FALSE
# 5     CO   1985 1.076    3209000 112.9635  49466672  31.0  94.26666 31.00000       FALSE              FALSE         FALSE
# 6     CT   1985 1.076    3201000 109.2784  60063368  42.0 128.02499 51.48333       FALSE              FALSE         FALSE
# 7     DE   1985 1.076     618000 143.8511   9927301  30.0 102.49166 30.00000       FALSE              FALSE         FALSE
# 8     FL   1985 1.076   11352000 122.1811 166919248  37.0 115.29000 42.49000       FALSE              FALSE         FALSE
# 9     GA   1985 1.076    5963000 127.2346  78364336  28.0  97.02517 28.84183       FALSE              FALSE         FALSE
#10     IA   1985 1.076    2830000 113.7456  37902896  34.0 101.84200 37.91700       FALSE              FALSE         FALSE
# ... with 86 more rows, and 4 more variables: income_outlier <lgl>, tax_outlier <lgl>, price_outlier <lgl>,
#   taxs_outlier <lgl>

如何在dplyr管道中仅使用lapply为特定输入列创建列？

1 个答案: