我想通过一些分组变量来识别我的数据集中的异常值。因此,我想创建/更改一个额外的列异常值,它将FALSE / TRUE作为相应值的行。在这里,我只想包含数字变量。
library(AER)
# Load Data
data("CigarettesSW")
head(CigarettesSW)
# state year cpi population packs income tax price taxs
# 1 AL 1985 1.076 3973000 116.5 46014968 32.5 102.18 33.35
# 2 AR 1985 1.076 2327000 128.5 26210736 37.0 101.47 37.00
# 3 AZ 1985 1.076 3184000 104.5 43956936 31.0 108.58 36.17
# 4 CA 1985 1.076 26444000 100.4 447102816 26.0 107.84 32.10
# 5 CO 1985 1.076 3209000 113.0 49466672 31.0 94.27 31.00
# 6 CT 1985 1.076 3201000 109.3 60063368 42.0 128.02 51.48
# Custom function
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
R> CigarettesSW %>% group_by(state) %>% mutate(outlier = lapply(., is_outlier))
Error in mutate_impl(.data, dots) : factors are not allowed
这里我试图只传递数字变量。
R> CigarettesSW %>% group_by(state) %>% mutate_at(3:9, outlier = lapply(., is_outlier))
Error in quantile.default(x, 0.25) : factors are not allowed
但是,这似乎会返回错误。不知道如何以不同的方式解决它。
答案 0 :(得分:3)
使用lapply
时,您不需要mutate_at
循环显示列;只需指定需要应用于所有列的函数:
CigarettesSW %>% group_by(state) %>% mutate_at(3:8, funs(outlier = is_outlier(.)))
# A tibble: 96 x 15
# Groups: state [48]
# state year cpi population packs income tax price taxs population_outlier packs_outlier
# <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
# 1 AL 1985 1.076 3973000 116.4863 46014968 32.5 102.18167 33.34834 FALSE FALSE
# 2 AR 1985 1.076 2327000 128.5346 26210736 37.0 101.47500 37.00000 FALSE FALSE
# 3 AZ 1985 1.076 3184000 104.5226 43956936 31.0 108.57875 36.17042 FALSE FALSE
# 4 CA 1985 1.076 26444000 100.3630 447102816 26.0 107.83734 32.10400 FALSE FALSE
# 5 CO 1985 1.076 3209000 112.9635 49466672 31.0 94.26666 31.00000 FALSE FALSE
# 6 CT 1985 1.076 3201000 109.2784 60063368 42.0 128.02499 51.48333 FALSE FALSE
# 7 DE 1985 1.076 618000 143.8511 9927301 30.0 102.49166 30.00000 FALSE FALSE
# 8 FL 1985 1.076 11352000 122.1811 166919248 37.0 115.29000 42.49000 FALSE FALSE
# 9 GA 1985 1.076 5963000 127.2346 78364336 28.0 97.02517 28.84183 FALSE FALSE
#10 IA 1985 1.076 2830000 113.7456 37902896 34.0 101.84200 37.91700 FALSE FALSE
# ... with 86 more rows, and 4 more variables: income_outlier <lgl>, tax_outlier <lgl>, price_outlier <lgl>,
# taxs_outlier <lgl>
或者要将该函数应用于所有数字列,您可以将mutate_if
与is.numeric
一起用作谓词:
CigarettesSW %>% group_by(state) %>% mutate_if(is.numeric, funs(outlier = is_outlier(.)))
# A tibble: 96 x 16
# Groups: state [48]
# state year cpi population packs income tax price taxs cpi_outlier population_outlier packs_outlier
# <fctr> <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
# 1 AL 1985 1.076 3973000 116.4863 46014968 32.5 102.18167 33.34834 FALSE FALSE FALSE
# 2 AR 1985 1.076 2327000 128.5346 26210736 37.0 101.47500 37.00000 FALSE FALSE FALSE
# 3 AZ 1985 1.076 3184000 104.5226 43956936 31.0 108.57875 36.17042 FALSE FALSE FALSE
# 4 CA 1985 1.076 26444000 100.3630 447102816 26.0 107.83734 32.10400 FALSE FALSE FALSE
# 5 CO 1985 1.076 3209000 112.9635 49466672 31.0 94.26666 31.00000 FALSE FALSE FALSE
# 6 CT 1985 1.076 3201000 109.2784 60063368 42.0 128.02499 51.48333 FALSE FALSE FALSE
# 7 DE 1985 1.076 618000 143.8511 9927301 30.0 102.49166 30.00000 FALSE FALSE FALSE
# 8 FL 1985 1.076 11352000 122.1811 166919248 37.0 115.29000 42.49000 FALSE FALSE FALSE
# 9 GA 1985 1.076 5963000 127.2346 78364336 28.0 97.02517 28.84183 FALSE FALSE FALSE
#10 IA 1985 1.076 2830000 113.7456 37902896 34.0 101.84200 37.91700 FALSE FALSE FALSE
# ... with 86 more rows, and 4 more variables: income_outlier <lgl>, tax_outlier <lgl>, price_outlier <lgl>,
# taxs_outlier <lgl>