我想根据除行以外的所有行来汇总多个列中的数据,这些行在单独的分组变量列中具有一定的值。例如,在下面的df中,我想基于未分配给与给定行匹配的群集的行中的值来获取A,B,C,D和E的中值。
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
df %>%
group_by(cluster) %>%
summarise_at(toupper(letters[1:5]), funs(m = fun_i_need_help_with(.)))
fun_i_need_help_with将等于:
first row: median(df[which(df$cluster != 1), "A"])
second row: median(df[which(df$cluster != 2), "A"])
and so on...
我可以使用嵌套的for循环来做到这一点,但是它运行起来很慢,而且似乎不是一个像R一样好的解决方案。
for(col in toupper(letters[1:5])){
for(clust in unique(df$cluster)){
df[which(df$cluster == clust), col] <-
median(df[which(df$cluster != clust), col])
}
}
答案 0 :(得分:2)
使用tidyverse
的解决方案。
set.seed(123)
df = data.frame(cluster = c(1:5, 1:3, 1:2),
A = rnorm(10, 2),
B = rnorm(10, 5),
C = rnorm(10, 0.4),
D = rnorm(10, 3),
E = rnorm(10, 1))
library(tidyverse)
df2 <- map_dfr(unique(df$cluster),
~df %>%
filter(cluster != .x) %>%
summarize_at(vars(-cluster), funs(median(.))) %>%
# Add a label to show the content of this row is not from a certain cluster number
mutate(not_cluster = .x))
df2
# A B C D E not_cluster
# 1 2.070508 5.110683 0.1820251 3.553918 0.7920827 1
# 2 2.070508 5.400771 -0.6260044 3.688640 0.5333446 2
# 3 1.920165 5.428832 -0.2769652 3.490191 0.8543568 3
# 4 1.769823 5.400771 -0.2250393 3.426464 0.5971152 4
# 5 1.769823 5.400771 -0.3288912 3.426464 0.5971152 5