使用group_by时添加总体均值

时间:2018-10-24 15:19:55

标签: r dplyr

我正在使用dplyr软件包生成一些表,并且正在使用adorn_totals("row")函数。

当我想对组内的值求和时,此方法很好,但是在某些情况下,我希望使用总体平均值而不是求和。有adorn_means函数吗?

示例代码:

Regions2 <- Data %>%
  filter(!is.na(REGION))%>%
  group_by(REGION) %>%
  summarise(Numberofpeople=length(Names))%>%
  adorn_totals("row")

在这里,我的“总计”行仅是区域内所有人员的总和。这给了我

REGION          NumberofPeople
East Midlands       578,943
East of England     682,917
London            1,247,540
North East          245,830
North West          742,886
South East          963,040
South West          623,684
West Midlands       653,335
Yorkshire           553,853
TOTAL             6,292,028

我的下一段代码为每个区域生成一个平均薪水,但是我想为总数加一个总体薪水

Regions3 <- Data %>%
  filter(!is.na(REGION))%>%
  filter(!is.na(AVGSalary))%>%
  group_by(REGION) %>%
  summarise(AverageSalary=mean(AVGSalary))

如果像以前一样使用adnorn_totals("row"),我只是得到平均值的总和,而不是数据集的总体平均值。

如何获得总体平均水平?

使用一些点头数据进行更新:

数据

people  region      salary
person1 London      1000
person2 South West  1050
person3 South East  900
person4 London      800
person5 Scotland    1020
person6 South West  750
person7 East        600
person8 London      1200
person9 South West  1150

因此,小组平均值为:

London      1000
South West  983.33
South East  900
Scotland    1020
East        600

我想将总数加到底部

Total    941.11

2 个答案:

答案 0 :(得分:1)

一种选择是使用bind_rows添加行

library(dplyr)
Data %>% 
   group_by(region) %>% 
   summarise(Avgsalary = mean(salary)) %>%
   bind_rows(data_frame(region = 'Total',
                        Avgsalary = mean(.$Avgsalary, na.rm = TRUE)))

或者另一个选择是add_row中的tibble

Data %>% 
   group_by(region) %>% 
   summarise(Avgsalary = mean(salary)) %>% 
   add_row(region = 'Total', Avgsalary = mean(.$Avgsalary))

如果这是基于服用mean之前的总体均值,那么我们需要先进行计算

Data %>%  
  mutate(Total = mean(salary)) %>% 
  group_by(region) %>%
  summarise(Avgsummary = mean(salary), Total = first(Total)) %>% 
  add_row(region = 'Total', Avgsummary = .$Total[1]) %>% 
  select(-Total)

答案 1 :(得分:1)

1)因为整体平均值是平均值的加权平均值(而不是平均值的纯平均值),即941而不是901,所以我们维护一个n列这样最后我们就可以正确计算总体平均值了。尽管显示的数据没有任何NA,但我们也将drop_na用于此类数据。这将删除任何包含NA的行。

library(dplyr)
library(tidyr)

Region %>%
  drop_na %>%
  group_by(region) %>%
  summarize(avg = mean(salary), n = n()) %>%
  ungroup %>%
  bind_rows(summarize(., region = "Overall Avg", 
                         avg = sum(avg * n) / sum(n), 
                         n = sum(n))) %>%
  select(-n)

给予:

# A tibble: 6 x 2
  region        avg
  <chr>       <dbl>
1 East         600 
2 London      1000 
3 Scotland    1020 
4 South East   900 
5 South West   983.
6 Overall Avg  941.

2)另一种方法是通过返回原始数据来构建总体平均线:

Region %>%
  drop_na %>%
  group_by(region) %>%
  summarize(avg = mean(salary)) %>%
  ungroup %>%
  bind_rows(summarize(Region %>% drop_na, region = "Overall Avg", avg = mean(salary)))

给予:

# A tibble: 6 x 2
  region        avg
  <chr>       <dbl>
1 East         600 
2 London      1000 
3 Scotland    1020 
4 South East   900 
5 South West   983.
6 Overall Avg  941.

2a)如果您反对两次引用Region,请尝试此操作。

Region_ <- Region %>% 
  drop_na

Region_ %>%
  group_by(region) %>%
  summarize(avg = mean(salary)) %>%
  ungroup %>%
  bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))

2b)或作为单个管道,其中Region_现在是该管道的本地管道,并且在管道完成后会自动删除:

Region %>%
  drop_na %>%
  { Region_ <- .
    Region_ %>%
      group_by(region) %>%
      summarize(avg = mean(salary)) %>%
      ungroup %>%
      bind_rows(summarize(Region_, region = "Overall Avg", avg = mean(salary)))
  }

注意

我们以此为输入

Lines <- "people  region      salary
person1 London      1000
person2 South West  1050
person3 South East  900
person4 London      800
person5 Scotland    1020
person6 South West  750
person7 East        600
person8 London      1200
person9 South West  1150"

library(gsubfn)
Region <- read.pattern(text = Lines, pattern = "^(\\S+) +(.*) (\\d+)$", 
  as.is = TRUE, skip = 1, strip.white = TRUE,
  col.names = read.table(text = Lines, nrow = 1, as.is = TRUE))