我有一个像这样的data.frame:
df <- structure(list(sample = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), sub_sample = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L), .Label = c("A", "B", "C"), class = "factor"), value = c(111L,
233L, NA, NA, NA, 56L, 48L, 23L, 48L, 567L, 98L, 75L, 7578L,
NA, 56L, 48L, NA, NA)), class = "data.frame", row.names = c(NA,
-18L))
有一些缺失值(NA),我想计算每个组中非NA的百分比。我现在的操作方式是这样的:
total_nr <- df %>%
group_by(sample, sub_sample) %>%
tally()
nr_wo_NA <- df %>%
group_by(sample, sub_sample) %>%
na.omit() %>%
tally()
nr_wo_NA$n <- (nr_wo_NA$n / total_nr$n) * 100
这给了我我想要的东西:
# A tibble: 6 x 3
# Groups: sample [2]
sample sub_sample n
<int> <fct> <dbl>
1 1 A 66.7
2 1 B 33.3
3 1 C 100
4 2 A 100
5 2 B 66.7
6 2 C 33.3
但是有没有一种方法可以在不创建两个单独的data.frames的情况下做到这一点?
答案 0 :(得分:2)
您可以这样做:
df %>%
group_by(sample, sub_sample) %>%
summarise(value_non_na = sum(!is.na(value))/n()*100)
sample sub_sample value_non_na
<int> <fct> <dbl>
1 1 A 66.7
2 1 B 33.3
3 1 C 100
4 2 A 100
5 2 B 66.7
6 2 C 33.3
答案 1 :(得分:2)
将mean
与is.na
进行比较后,我们可以取其逻辑值
library(dplyr)
df %>% group_by(sample, sub_sample)%>% summarise(value = mean(!is.na(value)) * 100)
# sample sub_sample value
# <int> <fct> <dbl>
#1 1 A 66.7
#2 1 B 33.3
#3 1 C 100
#4 2 A 100
#5 2 B 66.7
#6 2 C 33.3
我们可以对基数R使用相同的逻辑
aggregate(value~sample+sub_sample, df, function(x) mean(!is.na(x)), na.action = na.pass)
和data.table
library(data.table)
setDT(df)[, mean(!is.na(value)), .(sample, sub_sample)]