在另一个表中需要多个条件时执行求和?

时间:2018-12-28 10:14:25

标签: r

我正在尝试找到df1中有多少用户满足df2中指定的条件的总数,但是不断收到错误消息。

df1看起来像这样:

IntSummaryStatistics stats = Arrays.stream(arr).summaryStatistics();

System.out.println
    ((stats.getSum() - stats.getMax()) + " " + (stats.getSum() - stats.getMin()));

df2看起来像这样:

    id  step1          step2
    1   session_start  NA
    2   session_start  NA
    3   session_start  sign_up
    4   session_start  sign_up
    5   session_start  sign_up
    6   sign_up        session_start

df1 <- Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':  6 obs. of  3 variables:
    $ id   : chr  "1" "2" "3" "4" ...
    $ step1: chr  "session_start" "session_start" "session_start" "session_start" ...
    $ step2: chr  NA NA "sign_up" "sign_up" ..

step1 step2 count session_start sign_up 0 sign_up in_screen 0 in_screen click_banner 0 session_stop session_stop 0 df2 <- structure(c("session_start", "sign_up", "0", "sign_up", "in_screen", "0", "in_screen", "click_banner", "0", "session_stop", "session_stop", "0", .Dim = c(3L, 4L), .Dimnames = list(c("step1", "step2", "count"), NULL)) 列中,我想显示有多少个(总数)用户按此顺序完成了df2$countdf2$step1。在上面的示例代码中,df2$step2的第一行将输出 3 ,因为df1中的 3 用户以df2$count的身份完成了session_start并且df1$step1sign_up

以前的尝试

当我尝试使用此代码手动执行此操作时,一切正常:

df1step2

但是,当我用动态值替换“ session_start”和“ sign_up”时,出现错误“ test8 $ step1:$运算符对原子向量无效”:

count <- sum(df1$step1 == "session_start" & df1$step2 == "sign_up", na.rm = TRUE)

我尝试将“ $”替换为“ []”,但仍然收到“错误:列df2$count <- sum(df1$step1 == df2$step1 & df1$step2 == df2$step2, na.rm = TRUE) session_startsign_upin_screen,{{1} }找不到”:

click_banner

理想的反应:

我希望能够将额外的列添加到数据中,如下所示。你能帮忙吗?如果是的话,非常感谢!

session_stop

2 个答案:

答案 0 :(得分:3)

您可以使用mapply并计算在step1中完成的step2df1个值的数量。

df2$count <- mapply(function(x, y) 
    sum(df1$step1 == x & df1$step2 == y, na.rm = TRUE), df2$step1, df2$step2)


df2
#          step1        step2 count
#1 session_start      sign_up     3
#2       sign_up    in_screen     0
#3     in_screen click_banner     0
#4  session_stop session_stop     0

数据

df1 <- structure(list(id = c("1", "2", "3", "4", "5", "6"), 
step1 = c("session_start", "session_start", "session_start", 
 "session_start", "session_start", 
 "sign_up"), step2 = c(NA, NA, "sign_up", "sign_up", "sign_up", 
"session_start")), .Names = c("id", "step1", "step2"), row.names = c(NA, 
-6L), class = "data.frame")

df2 <- structure(list(step1 = c("session_start", "sign_up", "in_screen", 
"session_stop"), step2 = c("sign_up", "in_screen", "click_banner", 
"session_stop")), .Names = c("step1", "step2"), row.names = c(NA, 
-4L), class = "data.frame")

答案 1 :(得分:3)

这是一个tidyverse解决方案。

library(tidyverse)

df2 %>%
  group_by(step1, step2) %>%
  mutate(count = sum(step1 == df1$step1 & step2 == df1$step2, na.rm = TRUE))
## A tibble: 4 x 3
## Groups:   step1, step2 [4]
#  step1         step2        count
#  <chr>         <chr>        <int>
#1 session_start sign_up          3
#2 sign_up       in_screen        0
#3 in_screen     click_banner     0
#4 session_stop  session_stop     0

请注意,除了mutate之外,您还可以使用summarise,但是输出行的顺序将有所不同。