# A tibble: 42 x 5
Effective_Date Gender Location n freq
<date> <chr> <chr> <int> <dbl>
1 2017-01-01 Female India 281 0.351
2 2017-01-01 Female US 2446 0.542
3 2017-02-01 Female India 285 0.349
4 2017-02-01 Female US 2494 0.543
5 2017-03-01 Female India 293 0.353
6 2017-03-01 Female US 2494 0.542
7 2017-04-01 Female India 292 0.350
8 2017-04-01 Female US 2475 0.542
9 2017-05-01 Female India 272 0.337
10 2017-05-01 Female US 2493 0.540
如果我有下表,并且想在每个生效日期前添加一行,以得出平均值freq
。我将如何去做?我已经尝试过
tbl %>%
group_by(Effective_Date) %>%
mutate(Gender = 'Female',Location='All',freq_all = mean(freq)) %>%
bind_rows(female,.) %>%
ungroup() %>%
arrange(Effective_Date)
但这给了我很多重复的行。
理想结果应如下:
# A tibble: 42 x 5
Effective_Date Gender Location n freq
<date> <chr> <chr> <int> <dbl>
1 2017-01-01 Female India 281 0.351
2 2017-01-01 Female US 2446 0.542
3 2017-01-01 Female All NA 0.447
4 etc etc etc etc
答案 0 :(得分:2)
这将适用于您提供的特定示例:
df = read.table(text = "
Effective_Date Gender Location n freq
1 2017-01-01 Female India 281 0.351
2 2017-01-01 Female US 2446 0.542
3 2017-02-01 Female India 285 0.349
4 2017-02-01 Female US 2494 0.543
", header=T)
library(dplyr)
df %>%
group_by(Effective_Date) %>%
summarise(freq = mean(freq)) %>%
mutate(Gender = "Female",
Location = "all",
n = NA) %>%
bind_rows(df) %>%
arrange(Effective_Date)
# # A tibble: 6 x 5
# Effective_Date Gender Location n freq
# <fct> <chr> <chr> <int> <dbl>
# 1 2017-01-01 Female all NA 0.446
# 2 2017-01-01 Female India 281 0.351
# 3 2017-01-01 Female US 2446 0.542
# 4 2017-02-01 Female all NA 0.446
# 5 2017-02-01 Female India 285 0.349
# 6 2017-02-01 Female US 2494 0.543
这在更一般的情况下也适用,在Female
列中同时包含Male
和Gender
。
df = read.table(text = "
Effective_Date Gender Location n freq
1 2017-01-01 Female India 281 0.351
2 2017-01-01 Female US 2446 0.542
3 2017-02-01 Female India 285 0.349
4 2017-02-01 Female US 2494 0.543
5 2017-01-01 Male India 556 0.386
6 2017-01-01 Male US 1123 0.668
7 2017-02-01 Male India 449 0.389
8 2017-02-01 Male US 2237 0.511
", header=T)
library(dplyr)
df %>%
group_by(Effective_Date, Gender) %>%
summarise(freq = mean(freq)) %>%
ungroup() %>%
mutate(Location = "all",
n = NA) %>%
bind_rows(df) %>%
arrange(Effective_Date, Gender)
# # A tibble: 12 x 5
# Effective_Date Gender freq Location n
# <fct> <fct> <dbl> <chr> <int>
# 1 2017-01-01 Female 0.446 all NA
# 2 2017-01-01 Female 0.351 India 281
# 3 2017-01-01 Female 0.542 US 2446
# 4 2017-01-01 Male 0.527 all NA
# 5 2017-01-01 Male 0.386 India 556
# 6 2017-01-01 Male 0.668 US 1123
# 7 2017-02-01 Female 0.446 all NA
# 8 2017-02-01 Female 0.349 India 285
# 9 2017-02-01 Female 0.543 US 2494
#10 2017-02-01 Male 0.45 all NA
#11 2017-02-01 Male 0.389 India 449
#12 2017-02-01 Male 0.511 US 2237
答案 1 :(得分:2)
data.table中有一个用于此的功能:
library(data.table)
setDT(df)
res = groupingsets(df, by=c("Effective_Date", "Gender", "Location"),
sets=list(
c("Effective_Date", "Gender"),
c("Effective_Date", "Gender", "Location")
), j = .(n = sum(n), freq = mean(freq))
)[order(Effective_Date, Gender, Location, na.last=TRUE)]
Effective_Date Gender Location n freq
1: 2017-01-01 Female India 281 0.3510
2: 2017-01-01 Female US 2446 0.5420
3: 2017-01-01 Female <NA> 2727 0.4465
4: 2017-02-01 Female India 285 0.3490
5: 2017-02-01 Female US 2494 0.5430
6: 2017-02-01 Female <NA> 2779 0.4460
因此,您将分为两个级别,其中第二个级别不包括Location
。如果要显示"All"
而不是NA
,则显示res[is.na(Location), Location := "All"][]
。
(在这里似乎应该使用weighted.mean(freq, n)
之类的符号来代替mean(freq)
...这还包括所有行的计数n
,因为这看起来很奇怪,否则很乏味。 )
简短一些的文字:
myby = c("Effective_Date", "Gender", "Location")
groupingsets(df,
j = .(n = sum(n), freq = mean(freq)),
by=myby, sets=list(myby, head(myby, -1))
)[, setorderv(.SD, myby, na.last=TRUE)]