假设我在R中有以下小标题:
activation_date | country | campaign | revenue | users
======================================================
1 | 1 | 1 | R_1 | U_1
2 | 1 | 1 | R_2 | U_2
3 | 1 | 1 | R_3 | U_3
1 | 1 | 2 | R_4 | U_4
2 | 1 | 2 | R_5 | U_5
3 | 1 | 2 | R_6 | U_6
1 | 2 | 3 | R_7 | U_7
2 | 2 | 3 | R_8 | U_8
3 | 2 | 3 | R_9 | U_9
我想按国家对这个小标题进行分组,并汇总其数据以将该小标题作为其输出:
country | campaign | ltv
==========================
1 | 1 | ltv_1
1 | 2 | ltv_2
2 | 3 | ltv_3
但是,我希望ltv_1
和 ltv_2
都使用R_1
至R_6
和U_1
至{{ 1}}共同计算,U_6
使用ltv_3
到R_7
和R_9
到U_7
来计算。
我无法U_9
“国家”和group_by
,因为它摆脱了我想保留的“广告系列”列,但我不能summarise
都“国家/地区和“广告系列”,因为那样我将无法使用前三行来帮助计算group_by
,也无法使用后三行来帮助计算ltv_2
。
实现此目的的一种可能方法是按“国家/地区”分组,并使用group_modify
函数生成分组的输出小标题。但是,该功能处于“实验”阶段,因此我不想过分依赖它。有没有其他确定的方法可以做到这一点?
输入小标题示例为:
ltv_1
其输出为:
# A tibble: 9 x 5
activation_date country campaign revenue users
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 11
2 2 1 1 2 12
3 3 1 1 3 13
4 1 1 2 4 14
5 2 1 2 5 15
6 3 1 2 6 16
7 1 2 3 7 17
8 2 2 3 8 18
9 3 2 3 9 19
使用# A tibble: 3 x 3
country campaign ltv
<dbl> <dbl> <dbl>
1 1 1 0.213
2 1 2 0.296
3 2 3 0.444
函数生成代码的代码为:
group_modify
答案 0 :(得分:1)
使用joins
可以做到一些冗长但又 transparent 的方式。但是,考虑test_function
中的代码也不是那么冗长。 -
test_tibble %>%
group_by(country, campaign) %>%
summarize(campaign_ltv = sum(revenue)/sum(users)) %>%
inner_join(
test_tibble %>%
group_by(country) %>%
summarise(total_ltv = sum(revenue)/sum(users)),
by = "country"
) %>%
mutate(ltv = (total_ltv + campaign_ltv)/2) %>%
ungroup()
# A tibble: 3 x 5
country campaign campaign_ltv total_ltv ltv
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.167 0.259 0.213
2 1 2 0.333 0.259 0.296
3 2 3 0.444 0.444 0.444
将test_function
的输出包装在list
中,以嵌套的小标题的形式使用unnest
。
test_tibble %>%
group_by (country) %>%
mutate(
ltv = list(test_function(activation_date, campaign, revenue, users))
) %>%
select(country, ltv) %>%
filter(row_number() == 1) %>%
unnest() %>%
ungroup()
# A tibble: 3 x 3
country campaign ltv
<dbl> <dbl> <dbl>
1 1 1 0.213
2 1 2 0.296
3 2 3 0.444
df %>%
group_by(country) %>%
tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>%
group_by(campaign, add = TRUE)
# now you have all revenue and users for each country-campaign
# for total_ltv: use revenue and users as is
# for campaign_ltv: use revenue and users where activation_date is not NA
# A tibble: 15 x 5
# Groups: country, campaign [3]
country campaign revenue users activation_date
<int> <int> <chr> <chr> <int>
1 1 1 R_1 U_1 1
2 1 1 R_2 U_2 2
3 1 1 R_3 U_3 3
4 1 1 R_4 U_4 NA
5 1 1 R_5 U_5 NA
6 1 1 R_6 U_6 NA
7 1 2 R_1 U_1 NA
8 1 2 R_2 U_2 NA
9 1 2 R_3 U_3 NA
10 1 2 R_4 U_4 1
11 1 2 R_5 U_5 2
12 1 2 R_6 U_6 3
13 2 3 R_7 U_7 1
14 2 3 R_8 U_8 2
15 2 3 R_9 U_9 3
带有test_tibble
的演示-
test_tibble %>%
group_by(country) %>%
tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>%
group_by(campaign, add = TRUE) %>%
summarise(
ltv = sum(revenue)/sum(users)/2 +
sum(revenue[!is.na(activation_date)])/sum(users[!is.na(activation_date)])/2
) %>%
ungroup()
# A tibble: 3 x 3
country campaign ltv
<dbl> <dbl> <dbl>
1 1 1 0.213
2 1 2 0.296
3 2 3 0.444