在dplyr中添加新的分组变量

时间:2018-09-24 16:27:41

标签: r dplyr

# A tibble: 42 x 5
   Effective_Date Gender Location     n  freq
   <date>         <chr>  <chr>    <int> <dbl>
 1 2017-01-01     Female India      281 0.351
 2 2017-01-01     Female US        2446 0.542
 3 2017-02-01     Female India      285 0.349
 4 2017-02-01     Female US        2494 0.543
 5 2017-03-01     Female India      293 0.353
 6 2017-03-01     Female US        2494 0.542
 7 2017-04-01     Female India      292 0.350
 8 2017-04-01     Female US        2475 0.542
 9 2017-05-01     Female India      272 0.337
10 2017-05-01     Female US        2493 0.540

如果我有下表,并且想在每个生效日期前添加一行,以得出平均值freq。我将如何去做?我已经尝试过

tbl %>% 
  group_by(Effective_Date) %>% 
  mutate(Gender = 'Female',Location='All',freq_all = mean(freq)) %>% 
  bind_rows(female,.) %>% 
  ungroup() %>% 
  arrange(Effective_Date)

但这给了我很多重复的行。

理想结果应如下:

 # A tibble: 42 x 5
       Effective_Date Gender Location     n  freq
       <date>         <chr>  <chr>    <int> <dbl>
     1 2017-01-01     Female India      281 0.351
     2 2017-01-01     Female US        2446 0.542
     3 2017-01-01     Female All         NA 0.447
     4 etc etc etc etc

2 个答案:

答案 0 :(得分:2)

这将适用于您提供的特定示例

df = read.table(text = "
Effective_Date Gender Location     n  freq
1 2017-01-01     Female India      281 0.351
2 2017-01-01     Female US        2446 0.542
3 2017-02-01     Female India      285 0.349
4 2017-02-01     Female US        2494 0.543
", header=T)

library(dplyr)

df %>%
  group_by(Effective_Date) %>%
  summarise(freq = mean(freq)) %>%
  mutate(Gender = "Female",
         Location = "all",
         n = NA) %>%
  bind_rows(df) %>%
  arrange(Effective_Date)

# # A tibble: 6 x 5
#   Effective_Date Gender Location     n  freq
#   <fct>          <chr>  <chr>    <int> <dbl>
# 1 2017-01-01     Female all         NA 0.446
# 2 2017-01-01     Female India      281 0.351
# 3 2017-01-01     Female US        2446 0.542
# 4 2017-02-01     Female all         NA 0.446
# 5 2017-02-01     Female India      285 0.349
# 6 2017-02-01     Female US        2494 0.543

这在更一般的情况下也适用,在Female列中同时包含MaleGender

df = read.table(text = "
Effective_Date Gender Location     n  freq
1 2017-01-01     Female India      281 0.351
2 2017-01-01     Female US        2446 0.542
3 2017-02-01     Female India      285 0.349
4 2017-02-01     Female US        2494 0.543
5 2017-01-01     Male India      556 0.386
6 2017-01-01     Male US        1123 0.668
7 2017-02-01     Male India      449 0.389
8 2017-02-01     Male US        2237 0.511
", header=T)

library(dplyr)

df %>%
  group_by(Effective_Date, Gender) %>%
  summarise(freq = mean(freq)) %>%
  ungroup() %>%
  mutate(Location = "all",
         n = NA) %>%
  bind_rows(df) %>%
  arrange(Effective_Date, Gender) 

# # A tibble: 12 x 5
#   Effective_Date Gender  freq Location     n
#   <fct>          <fct>  <dbl> <chr>    <int>
# 1 2017-01-01     Female 0.446 all         NA
# 2 2017-01-01     Female 0.351 India      281
# 3 2017-01-01     Female 0.542 US        2446
# 4 2017-01-01     Male   0.527 all         NA
# 5 2017-01-01     Male   0.386 India      556
# 6 2017-01-01     Male   0.668 US        1123
# 7 2017-02-01     Female 0.446 all         NA
# 8 2017-02-01     Female 0.349 India      285
# 9 2017-02-01     Female 0.543 US        2494
#10 2017-02-01     Male   0.45  all         NA
#11 2017-02-01     Male   0.389 India      449
#12 2017-02-01     Male   0.511 US        2237

答案 1 :(得分:2)

data.table中有一个用于此的功能:

library(data.table)
setDT(df)

res = groupingsets(df, by=c("Effective_Date", "Gender", "Location"), 
  sets=list(
    c("Effective_Date", "Gender"), 
    c("Effective_Date", "Gender", "Location")
  ), j = .(n = sum(n), freq = mean(freq))
)[order(Effective_Date, Gender, Location, na.last=TRUE)]

   Effective_Date Gender Location    n   freq
1:     2017-01-01 Female    India  281 0.3510
2:     2017-01-01 Female       US 2446 0.5420
3:     2017-01-01 Female     <NA> 2727 0.4465
4:     2017-02-01 Female    India  285 0.3490
5:     2017-02-01 Female       US 2494 0.5430
6:     2017-02-01 Female     <NA> 2779 0.4460

因此,您将分为两个级别,其中第二个级别不包括Location。如果要显示"All"而不是NA,则显示res[is.na(Location), Location := "All"][]

(在这里似乎应该使用weighted.mean(freq, n)之类的符号来代替mean(freq) ...这还包括所有行的计数n,因为这看起来很奇怪,否则很乏味。 )

简短一些的文字:

myby = c("Effective_Date", "Gender", "Location")
groupingsets(df, 
  j = .(n = sum(n), freq = mean(freq)), 
  by=myby, sets=list(myby, head(myby, -1))
)[, setorderv(.SD, myby, na.last=TRUE)]