如何根据列值添加增量排名?

时间:2018-05-29 19:08:58

标签: r dataframe group-by data.table

我有以下格式的数据框:

sample_df <- structure(list(conversationid = c("C1",  "C2", "C2",  "C2", 
"C2",  "C2", "C3",  "C3", "C3",  "C3"), 
sentby = c("Consumer","Consumer", "Agent", "Agent", "Agent", "Consumer", 
"Agent", "Consumer","Agent", "Agent"), 
time = c("2018-04-25 03:54:04.550+0000", "2018-05-11 19:18:05.094+0000", 
     "2018-05-11 19:18:09.218+0000", "2018-05-11 19:18:09.467+0000", 
     "2018-05-11 19:18:13.527+0000", "2018-05-14 22:57:10.004+0000", 
     "2018-05-14 22:57:14.330+0000", "2018-05-14 22:57:20.795+0000", 
     "2018-05-14 22:57:22.168+0000", "2018-05-14 22:57:24.203+0000"),
diff = c(NA, NA, 0.0687333333333333, 0.00415, 0.0676666666666667, NA, 0.0721, 
0.10775, 0.0228833333333333,0.0339166666666667)), 
.Names = c("conversationid", "sentby","time","diff"), row.names = c(NA, 10L), 
class = "data.frame")

其中conversationid是会话ID,可以包含代理或客户发送的消息。我想做的是,只要&#34;代理&#34;保持运行计数。出现在对话中,如下:

目标输出:

conversationid  sentby  diff    agent_counter_flag
        C1     Consumer NA          0
        C2     Consumer NA          0
        C2     Agent    0.06873333  1
        C2     Agent    0.00415     2
        C2     Agent    0.06766667  3
        C2     Consumer NA          0
        C3     Agent    0.0721      1
        C3     Consumer 0.10775     0
        C3     Agent    0.02288333  2
        C3     Agent    0.03391667  3

目前,我能够对数据帧进行分区,并使用以下代码对由cid分组的所有记录进行排名:

setDT(sample_df)
sample_df[,Order := rank(time, ties.method = "first"), by = "conversationid"]
sample_df <- as.data.frame(sample_df)

但它所做的只是在一个分区内对记录进行排名,而不管它是否是一个代理&#34;代理&#34;或&#34;客户&#34;。

当前输出:

   conversationid   sentby  diff    Order
        C1     Consumer NA          1
        C2     Consumer NA          1
        C2     Agent    0.06873333  2
        C2     Agent    0.00415     3
        C2     Agent    0.06766667  4
        C2     Consumer NA          5
        C3     Agent    0.0721      1
        C3     Consumer 0.10775     2
        C3     Agent    0.02288333  3
        C3     Agent    0.03391667  4

如何继续,以便我可以在目标输出中显示我的数据帧?谢谢!

4 个答案:

答案 0 :(得分:2)

library(data.table)
setDT(sample_df)

sample_df[, agent_counter_flag := {sba = (sentby == 'Agent'); sba*cumsum(sba)}
          , by = conversationid]
sample_df

#     conversationid   sentby                         time       diff agent_counter_flag
#  1:             C1 Consumer 2018-04-25 03:54:04.550+0000         NA                  0
#  2:             C2 Consumer 2018-05-11 19:18:05.094+0000         NA                  0
#  3:             C2    Agent 2018-05-11 19:18:09.218+0000 0.06873333                  1
#  4:             C2    Agent 2018-05-11 19:18:09.467+0000 0.00415000                  2
#  5:             C2    Agent 2018-05-11 19:18:13.527+0000 0.06766667                  3
#  6:             C2 Consumer 2018-05-14 22:57:10.004+0000         NA                  0
#  7:             C3    Agent 2018-05-14 22:57:14.330+0000 0.07210000                  1
#  8:             C3 Consumer 2018-05-14 22:57:20.795+0000 0.10775000                  0
#  9:             C3    Agent 2018-05-14 22:57:22.168+0000 0.02288333                  2
# 10:             C3    Agent 2018-05-14 22:57:24.203+0000 0.03391667                  3

正如@Frank所指出的,这也有效

sample_df[, agent_counter_flag := rowid(conversationid, sentby)*(sentby == "Agent")]

基准

sample_df <- replicate(1000, sample_df, simplify = F) %>% rbindlist
microbenchmark(
  rowidFrank = sample_df[, agent_counter_flag := 
                           rowid(conversationid, sentby)*(sentby == "Agent")]
, rowidUwe = sample_df[sentby == "Agent", agent_counter_flag := rowid(conversationid)]
, cumsum   = sample_df[, agent_counter_flag := {sba = (sentby == 'Agent'); sba*cumsum(sba)}
                       , by = conversationid]
, unit = 'relative')

# Unit: relative
# expr            min       lq     mean   median       uq       max neval
# rowidFrank 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100
# rowidUwe   1.448858 1.438742 1.410849 1.414428 1.535292 0.5549433   100
# cumsum     1.322493 1.306228 1.316188 1.261325 1.308371 1.6431036   100

答案 1 :(得分:1)

这是我的data.table解决方案,该解决方案使用rowid()函数并通过引用创建新列agent_counter_flag

library(data.table)
setDT(sample_df)
sample_df[sentby == "Agent", agent_counter_flag := rowid(conversationid)][]
    conversationid   sentby                         time       diff agent_counter_flag
 1:             C1 Consumer 2018-04-25 03:54:04.550+0000         NA                 NA
 2:             C2 Consumer 2018-05-11 19:18:05.094+0000         NA                 NA
 3:             C2    Agent 2018-05-11 19:18:09.218+0000 0.06873333                  1
 4:             C2    Agent 2018-05-11 19:18:09.467+0000 0.00415000                  2
 5:             C2    Agent 2018-05-11 19:18:13.527+0000 0.06766667                  3
 6:             C2 Consumer 2018-05-14 22:57:10.004+0000         NA                 NA
 7:             C3    Agent 2018-05-14 22:57:14.330+0000 0.07210000                  1
 8:             C3 Consumer 2018-05-14 22:57:20.795+0000 0.10775000                 NA
 9:             C3    Agent 2018-05-14 22:57:22.168+0000 0.02288333                  2
10:             C3    Agent 2018-05-14 22:57:24.203+0000 0.03391667                  3

答案 2 :(得分:0)

你在这里:

library(dplyr)

df <- data.frame(cid = c(rep("c1", 6), rep("C2", 4)),
                 Sent_by = c("Consumer", "Agent", "Consumer", "Consumer", "Agent", "Agent",
                             "Consumer", "Agent", "Agent", "Consumer"))
df %>% group_by(cid, Sent_by) %>%
  mutate(agent_flag = ifelse(Sent_by == "Agent", 1:n(), NA),
         consumer_flag = ifelse(Sent_by == "Consumer", 1:n(), NA))

答案 3 :(得分:0)

通过这篇文章来解决.app-main { font-family: Helvetica; } .nav-column { font-family: Helvetica; font-size: 18px; background-color: aqua; } .content-column { font-size: 18px; background-color: darkkhaki; }的类似问题。您可以使用dplyr的分组来对经过测试的sentby == "Agent"的逻辑值求和。

很长一段路,只是要阐明逻辑列的外观:

dplyr

您可能想跟着library(dplyr) sample_df %>% mutate(is_agent = sentby == "Agent") %>% group_by(conversationid) %>% mutate(agent_counter_flag = ifelse(is_agent, cumsum(is_agent), 0)) %>% ungroup() #> # A tibble: 10 x 6 #> conversationid sentby time diff is_agent agent_counter_f… #> <chr> <chr> <chr> <dbl> <lgl> <dbl> #> 1 C1 Consum… 2018-04-25 03… NA FALSE 0 #> 2 C2 Consum… 2018-05-11 19… NA FALSE 0 #> 3 C2 Agent 2018-05-11 19… 0.0687 TRUE 1 #> 4 C2 Agent 2018-05-11 19… 0.00415 TRUE 2 #> 5 C2 Agent 2018-05-11 19… 0.0677 TRUE 3 #> 6 C2 Consum… 2018-05-14 22… NA FALSE 0 #> 7 C3 Agent 2018-05-14 22… 0.0721 TRUE 1 #> 8 C3 Consum… 2018-05-14 22… 0.108 FALSE 0 #> 9 C3 Agent 2018-05-14 22… 0.0229 TRUE 2 #> 10 C3 Agent 2018-05-14 22… 0.0339 TRUE 3 来删除逻辑列。

或者实际上,对于缩写形式,可以在select(-is_agent)内调用cumsum

mutate

无论哪种方式,想法都是在每个sample_df %>% group_by(conversationid) %>% mutate(agent_counter_flag = ifelse(sentby == "Agent", cumsum(sentby == "Agent"), 0)) %>% ungroup() 内添加conversationid的数量(如果它是由代理发送的),或者将其设置为0(如果它不是由代理发送的)。