计算从1列到另一个值的总和,然后计算总和

时间:2018-05-30 17:19:06

标签: r grep dplyr mutate

我有一个如下所示的数据集:

User<- c("User1", "User1","User1", "User1","User1", "User1","User1", "User2","User2","User2","User2","User2","User2","User2")
Touchpoints <- c("A", "B", "C", "F", "D", "E", "H","A", "B", "K", "D", "E", "F", "M")
Conversion <- c(0,0,0,1,0,0,1,0,0,1,1,0,0,1)
Frequency<-c(1,2,3,0,4,5,0,1,2,0,0,3,4,5)
df<-data.frame(User, Touchpoints, Conversion, Frequency)
df$Exponential<-ifelse(df$Frequency>0, exp(df$Frequency), 0)

df
    User Touchpoints Conversion Frequency Exponential
1  User1           A          0         1    2.718282
2  User1           B          0         2    7.389056
3  User1           C          0         3   20.085537
4  User1           F          1         0    0.000000
5  User1           D          0         4   54.598150
6  User1           E          0         5  148.413159
7  User1           H          1         0    0.000000
8  User2           A          0         1    2.718282
9  User2           B          0         2    7.389056
10 User2           K          1         0    0.000000
11 User2           D          1         0    0.000000
12 User2           E          0         3   20.085537
13 User2           F          0         4   54.598150
14 User2           M          1         5  148.413159

以下是我要做的事情:

我想将Exponential所代表的_Conv值的百分比从Exponential列的总和加UserConversion值。这是一个例子:

    User Touchpoints Conversion Frequency Exponential   Sum of Exp    1st_Conv   Sum_Exp_for_Conv2     2nd_Conv
1  User1           A          0         1    2.718282       30.192      0.0900             233.204       0.0116
2  User1           B          0         2    7.389056       30.192      0.2447             233.204       0.0317
3  User1           C          0         3   20.085537       30.192      0.6652             233.204       0.0861
4  User1           F          1         0    0.000000            0      0.0000             233.204            0     
5  User1           D          0         4   54.598150            0      0.0000             233.204       0.2341
6  User1           E          0         5  148.413159            0      0.0000             233.204       0.6364
7  User1           H          1         0    0.000000            0      0.0000                   0            0
8  User2           A          0         1    2.718282       10.107      0.2689              10.107       0.2689
9  User2           B          0         2    7.389056       10.107      0.7311              10.107       0.7311
10 User2           K          1         0    0.000000            0      0.0000                   0            0
11 User2           D          1         0    0.000000            0      0.0000                   0            0
12 User2           E          0         3   20.085537            0      0.0000                   0            0
13 User2           F          0         4   54.598150            0      0.0000                   0            0
14 User2           M          0         5  148.413159            0      0.0000                   0            0

有些情况下,每个用户将有超过100次转换,并且通过这种方式创建数千列,似乎无法扩展。

我的最终输出是将所有_Conv加到一个名为Final_Conv的最后一列中。对于此示例,最终输出将如下所示:

        User Touchpoints Conversion Frequency   Final_Conv
    1  User1           A          0         1       0.1017      
    2  User1           B          0         2       0.2764
    3  User1           C          0         3       0.7514
    4  User1           F          1         0            0
    5  User1           D          0         4       0.2341
    6  User1           E          0         5       0.6364
    7  User1           H          1         0            0
    8  User2           A          0         1       0.5379
    9  User2           B          0         2       1.4621
    10 User2           K          1         0            0
    11 User2           D          1         0            0
    12 User2           E          0         3            0
    13 User2           F          0         4            0
    14 User2           M          0         5            0

任何帮助都会很棒,谢谢!

1 个答案:

答案 0 :(得分:1)

可能不是最简单的代码,但我们可以执行以下操作:

library(dplyr)
library(tidyr)

df %>%
  group_by(User) %>%
  mutate(row_id = row_number(),
         conv_id = cumsum(Conversion),
         exp_cumsum = cumsum(Exponential)) %>%
  group_by(conv_id, add = TRUE) %>%
  mutate(sum_of_exp = ifelse(n()==1, NA, last(exp_cumsum))) %>%
  spread(conv_id, sum_of_exp, sep = "_") %>%
  arrange(User, row_id) %>%
  fill(!!!vars(starts_with("conv_id")), .direction = "up") %>%
  mutate_at(vars(starts_with("conv_id")), funs(Exponential/.)) %>%
  ungroup() %>%
  mutate(Final_Conv = rowSums(.[-(1:7)], na.rm = TRUE)) %>%
  select(1:4, Final_Conv)

备注:

我首先创建了ConversionExponential的累积总和,添加了conv_id作为额外的分组变量,并替换了每个User + {{1}中的所有值与conv_id的最后一个值组合。然后,展开exp_cumsumconv_id列并向上填充每个sum_of_exp列。最后,使用conv_id_将每个mutate_at列划分为Exponential,并通过将所有生成的conv_id_列与Final_Conv相加来创建conv_id_

对于每个rowSums,此解决方案适用于任意数量的Conversion

<强>结果:

User