在dplyr中有一种优雅的方法可以将NA处理为0(na.rm = TRUE)吗?
data <- data.frame(a=c(1,2,3,4), b=c(4,NA,5,6), c=c(7,8,9,NA))
data %>% mutate(sum = a + b + c)
a b c sum
1 4 7 12
2 NA 8 NA
3 5 9 17
4 6 NA NA
but I like to get
a b c sum
1 4 7 12
2 NA 8 10
3 5 9 17
4 6 NA 10
即使我知道在许多其他情况下这不是理想的结果
答案 0 :(得分:35)
你可以用这个:
library(dplyr)
data %>%
#rowwise will make sure the sum operation will occur on each row
rowwise() %>%
#then a simple sum(..., na.rm=TRUE) is enough to result in what you need
mutate(sum = sum(a,b,c, na.rm=TRUE))
输出:
Source: local data frame [4 x 4]
Groups: <by row>
a b c sum
(dbl) (dbl) (dbl) (dbl)
1 1 4 7 12
2 2 NA 8 10
3 3 5 9 17
4 4 6 NA 10
答案 1 :(得分:17)
另一种选择:
data %>%
mutate(sum = rowSums(., na.rm = TRUE))
<强>基准强>
library(microbenchmark)
mbm <- microbenchmark(
steven = data %>% mutate(sum = rowSums(., na.rm = TRUE)),
lyz = data %>% rowwise() %>% mutate(sum = sum(a, b, c, na.rm=TRUE)),
nar = apply(data, 1, sum, na.rm = TRUE),
akrun = data %>% mutate_each(funs(replace(., which(is.na(.)), 0))) %>% mutate(sum=a+b+c),
frank = data %>% mutate(sum = Reduce(function(x,y) x + replace(y, is.na(y), 0), .,
init=rep(0, n()))),
times = 10)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# steven 9.493812 9.558736 18.31476 10.10280 22.55230 65.15325 10 a
# lyz 6791.690570 6836.243782 6978.29684 6915.16098 7138.67733 7321.61117 10 c
# nar 702.537055 723.256808 799.79996 805.71028 849.43815 909.36413 10 b
# akrun 11.372550 11.388473 28.49560 11.44698 20.21214 155.23165 10 a
# frank 20.206747 20.695986 32.69899 21.12998 25.11939 118.14779 10 a
答案 2 :(得分:5)
或者我们可以replace
NA
使用0,然后使用OP的代码
data %>%
mutate_each(funs(replace(., which(is.na(.)), 0))) %>%
mutate(Sum= a+b+c)
#or as @Frank mentioned
#mutate(Sum = Reduce(`+`, .))
基于使用@StevenBeaupré数据的基准测试,它似乎也很有效。
答案 3 :(得分:1)
使用新的dplyr
1.0.0,您可以将c_across
与rowwise
一起使用。
library(dplyr)
data %>%
rowwise() %>%
mutate(sum = sum(c_across(a:c), na.rm = TRUE))
# a b c sum
# <dbl> <dbl> <dbl> <dbl>
#1 1 4 7 12
#2 2 NA 8 10
#3 3 5 9 17
#4 4 6 NA 10
答案 4 :(得分:0)
试试这个
data$sum <- apply(data, 1, sum, na.rm = T)
结果data
是
a b c sum
1 1 4 7 12
2 2 NA 8 10
3 3 5 9 17
4 4 6 NA 10
答案 5 :(得分:0)
这里与Steven的方法类似,但是包括dplyr::select()
来明确声明要包括/忽略的列(例如ID变量)。
data %>%
mutate(sum = rowSums(dplyr::select(., a, b, c), na.rm = TRUE))
它具有与实际大小的数据集相当的性能。我不确定为什么会这样,因为在这个瘦小的示例中实际上没有排除任何列。
具有1M行的更大数据集:
pick <- function() { sample(c(1:5, NA), 1000000, replace=T) }
data <- data.frame(a=pick(), b=pick(), c=pick())
结果:
Unit: milliseconds
expr min lq mean median uq max neval cld
steven 22.05847 22.96164 56.84822 28.85411 54.99691 174.58447 10 a
wibeasley 25.10274 26.98303 30.66911 29.30630 30.63343 49.46048 10 a
lyz 10408.89904 10548.33756 10887.51930 10720.92372 11017.56256 12250.41370 10 c
nar 1975.35941 2011.36445 2123.81705 2090.43174 2172.80501 2362.13658 10 b
akrun 31.27247 35.41943 81.33320 57.93900 63.59119 302.21059 10 a
frank 37.48265 38.72270 65.02965 41.62735 44.45775 261.79898 10 a