除组外汇总变量

时间:2020-02-10 15:16:30

标签: r dplyr summarize

我有一个data.frame,我需要计算每个“反组”的平均值(即下面的每个名称)。

Name     Month  Rate1     Rate2
Aira       1      12        23
Aira       2      18        73
Aira       3      19        45
Ben        1      53        19
Ben        2      22        87
Ben        3      19        45
Cat        1      22        87
Cat        2      67        43
Cat        3      45        32

我想要的输出如下所示,其中Rate1和Rate2的值是在每个组中找不到的列值的均值。请忽略该值,我已经在示例中进行了弥补。如果可能的话,我更愿意使用 dplyr

Name    Rate1   Rate2
Aira    38      52.2
Ben     30.5    50.5
Cat     23.8    48.7

任何帮助,不胜感激!谢谢!

PS-感谢 Ianthe 复制了他们的问题及其问题的数据,但对问题进行了一些更改。 (Mean per group in a data.frame

6 个答案:

答案 0 :(得分:2)

这是基于R的另一个想法,

do.call(rbind, lapply(unique(df$Name), function(i)colMeans(df[!df$Name %in% i,-c(1:2)])))

#        Rate1    Rate2
#[1,] 38.00000 52.16667
#[2,] 30.50000 50.50000
#[3,] 23.83333 48.66667

或以Name结尾,

cbind.data.frame(Name = unique(df$Name), res1)

#  Name    Rate1    Rate2
#1 Aira 38.00000 52.16667
#2  Ben 30.50000 50.50000
#3  Cat 23.83333 48.66667

答案 1 :(得分:1)

library(tidyverse)

# exampel dataset
df = read.table(text = "
Name     Month  Rate1     Rate2
Aira       1      12        23
Aira       2      18        73
Aira       3      19        45
Ben        1      53        19
Ben        2      22        87
Ben        3      19        45
Cat        1      22        87
Cat        2      67        43
Cat        3      45        32
", header=T, stringsAsFactors=F)

# function that returns means of Rates after excluding a given name
AntiGroupMean = function(x) { df %>% filter(Name != x) %>% summarise_at(vars(matches("Rate")), mean) }

df %>%
  distinct(Name) %>%                         # for each name
  mutate(v = map(Name, AntiGroupMean)) %>%   # apply the function
  unnest(v)                                  # unnest results

# # A tibble: 3 x 3
#   Name  Rate1 Rate2
#   <chr> <dbl> <dbl>
# 1 Aira   38    52.2
# 2 Ben    30.5  50.5
# 3 Cat    23.8  48.7

答案 2 :(得分:1)

一个选项可能是:

df %>%
 mutate_at(vars(Rate1, Rate2), list(sum = ~ sum(.))) %>%
 mutate(rows = n()) %>%
 group_by(Name) %>%
 summarise(Rate1 = first((Rate1_sum - sum(Rate1))/(rows-n())),
           Rate2 = first((Rate2_sum - sum(Rate2))/(rows-n())))

  Name  Rate1 Rate2
  <chr> <dbl> <dbl>
1 Aira   38    52.2
2 Ben    30.5  50.5
3 Cat    23.8  48.7

或以不太简洁的形式:

df %>%
 group_by(Name) %>%
 summarise(Rate1 = first((sum(df$Rate1) - sum(Rate1))/(nrow(df)-n())),
           Rate2 = first((sum(df$Rate2) - sum(Rate2))/(nrow(df)-n())))

答案 3 :(得分:1)

您可以将其计算为组均值的平均值,由每个组中观察值的数量加权,但给定行的权重等于0。

library(dplyr)

df %>% 
  group_by(Name) %>% 
  summarise(n = n(), Rate1 = mean(Rate1), Rate2 = mean(Rate2)) %>% 
  mutate_at(vars(starts_with('Rate')),  ~
    sapply(Name, function(x) weighted.mean(.x, n*(Name != x))))

# A tibble: 3 x 4
  Name      n Rate1 Rate2
  <chr> <int> <dbl> <dbl>
1 Aira      3  38    52.2
2 Ben       3  30.5  50.5
3 Cat       3  23.8  48.7

答案 4 :(得分:0)

我们可以使用

library(dplyr)
library(purrr)
map_dfr(unique(df1$Name), ~ 
   anti_join(df1, tibble(Name = .x)) %>% 
   summarise_at(vars(starts_with('Rate')), mean) %>%
   mutate(Name = .x)) %>%
   select(Name, everything())
#    Name    Rate1    Rate2
#1 Aira 38.00000 52.16667
#2  Ben 30.50000 50.50000
#3  Cat 23.83333 48.66667

数据

df1 <- structure(list(Name = c("Aira", "Aira", "Aira", "Ben", "Ben", 
"Ben", "Cat", "Cat", "Cat"), Month = c(1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L), Rate1 = c(12L, 18L, 19L, 53L, 22L, 19L, 22L, 67L, 
45L), Rate2 = c(23L, 73L, 45L, 19L, 87L, 45L, 87L, 43L, 32L)), 
 class = "data.frame", row.names = c(NA, 
-9L))

答案 5 :(得分:0)

您可以尝试:

library(dplyr)

df %>%
  mutate_at(
    vars(contains('Rate')),
    ~ sapply(1:n(), function(x) mean(.[Name %in% setdiff(unique(df$Name), Name[x])], na.rm = TRUE)
             )
  ) %>%
  distinct_at(vars(-Month))

输出:

  Name    Rate1    Rate2
1 Aira 38.00000 52.16667
2  Ben 30.50000 50.50000
3  Cat 23.83333 48.66667

(尽管使用其他解决方案可能会更好,因为通过行sapply在较大的数据集上会非常慢)