汇总小标题到多行输出

时间:2019-08-13 18:16:05

标签: r dplyr tibble

假设我在R中有以下小标题:

activation_date | country | campaign | revenue | users
======================================================
1               | 1       | 1        | R_1     | U_1
2               | 1       | 1        | R_2     | U_2
3               | 1       | 1        | R_3     | U_3
1               | 1       | 2        | R_4     | U_4
2               | 1       | 2        | R_5     | U_5
3               | 1       | 2        | R_6     | U_6
1               | 2       | 3        | R_7     | U_7
2               | 2       | 3        | R_8     | U_8
3               | 2       | 3        | R_9     | U_9

我想按国家对这个小标题进行分组,并汇总其数据以将该小标题作为其输出:

country | campaign | ltv
==========================
1       | 1        | ltv_1
1       | 2        | ltv_2
2       | 3        | ltv_3

但是,我希望ltv_1 ltv_2都使用R_1R_6U_1至{{ 1}}共同计算,U_6使用ltv_3R_7R_9U_7来计算。

我无法U_9“国家”和group_by,因为它摆脱了我想保留的“广告系列”列,但我不能summarise都“国家/地区和“广告系列”,因为那样我将无法使用前三行来帮助计算group_by,也无法使用后三行来帮助计算ltv_2

实现此目的的一种可能方法是按“国家/地区”分组,并使用group_modify函数生成分组的输出小标题。但是,该功能处于“实验”阶段,因此我不想过分依赖它。有没有其他确定的方法可以做到这一点?


输入小标题示例为:

ltv_1

其输出为:

# A tibble: 9 x 5
  activation_date country campaign revenue users
            <dbl>   <dbl>    <dbl>   <dbl> <dbl>
1               1       1        1       1    11
2               2       1        1       2    12
3               3       1        1       3    13
4               1       1        2       4    14
5               2       1        2       5    15
6               3       1        2       6    16
7               1       2        3       7    17
8               2       2        3       8    18
9               3       2        3       9    19

使用# A tibble: 3 x 3 country campaign ltv <dbl> <dbl> <dbl> 1 1 1 0.213 2 1 2 0.296 3 2 3 0.444 函数生成代码的代码为:

group_modify

1 个答案:

答案 0 :(得分:1)

选项1-

使用joins可以做到一些冗长但又 transparent 的方式。但是,考虑test_function中的代码也不是那么冗长。 -

test_tibble %>% 
  group_by(country, campaign) %>% 
  summarize(campaign_ltv = sum(revenue)/sum(users)) %>% 
  inner_join(
    test_tibble %>% 
      group_by(country) %>% 
      summarise(total_ltv = sum(revenue)/sum(users)),
    by = "country"
  ) %>% 
  mutate(ltv = (total_ltv + campaign_ltv)/2) %>% 
  ungroup()

# A tibble: 3 x 5
  country campaign campaign_ltv total_ltv   ltv
    <dbl>    <dbl>        <dbl>     <dbl> <dbl>
1       1        1        0.167     0.259 0.213
2       1        2        0.333     0.259 0.296
3       2        3        0.444     0.444 0.444

选项2)-

test_function的输出包装在list中,以嵌套的小标题的形式使用unnest

test_tibble %>%
  group_by (country) %>%
  mutate(
    ltv = list(test_function(activation_date, campaign, revenue, users))
  ) %>%
  select(country, ltv) %>% 
  filter(row_number() == 1) %>% 
  unnest() %>% 
  ungroup()

# A tibble: 3 x 3
  country campaign   ltv
    <dbl>    <dbl> <dbl>
1       1        1 0.213
2       1        2 0.296
3       2        3 0.444

选项3)-

df %>% 
  group_by(country) %>% 
  tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>% 
  group_by(campaign, add = TRUE)
  # now you have all revenue and users for each country-campaign
  # for total_ltv: use revenue and users as is
  # for campaign_ltv: use revenue and users where activation_date is not NA

# A tibble: 15 x 5
# Groups:   country, campaign [3]
   country campaign revenue users activation_date
     <int>    <int> <chr>   <chr>           <int>
 1       1        1 R_1     U_1                 1
 2       1        1 R_2     U_2                 2
 3       1        1 R_3     U_3                 3
 4       1        1 R_4     U_4                NA
 5       1        1 R_5     U_5                NA
 6       1        1 R_6     U_6                NA
 7       1        2 R_1     U_1                NA
 8       1        2 R_2     U_2                NA
 9       1        2 R_3     U_3                NA
10       1        2 R_4     U_4                 1
11       1        2 R_5     U_5                 2
12       1        2 R_6     U_6                 3
13       2        3 R_7     U_7                 1
14       2        3 R_8     U_8                 2
15       2        3 R_9     U_9                 3

带有test_tibble的演示-

test_tibble %>% 
  group_by(country) %>% 
  tidyr::complete(nesting(country, campaign), nesting(revenue, users)) %>% 
  group_by(campaign, add = TRUE) %>% 
  summarise(
    ltv = sum(revenue)/sum(users)/2 + 
      sum(revenue[!is.na(activation_date)])/sum(users[!is.na(activation_date)])/2
  ) %>% 
  ungroup()

# A tibble: 3 x 3
  country campaign   ltv
    <dbl>    <dbl> <dbl>
1       1        1 0.213
2       1        2 0.296
3       2        3 0.444