我有以下格式的数据框:
sample_df <- structure(list(conversationid = c("C1", "C2", "C2", "C2",
"C2", "C2", "C3", "C3", "C3", "C3"),
sentby = c("Consumer","Consumer", "Agent", "Agent", "Agent", "Consumer",
"Agent", "Consumer","Agent", "Agent"),
time = c("2018-04-25 03:54:04.550+0000", "2018-05-11 19:18:05.094+0000",
"2018-05-11 19:18:09.218+0000", "2018-05-11 19:18:09.467+0000",
"2018-05-11 19:18:13.527+0000", "2018-05-14 22:57:10.004+0000",
"2018-05-14 22:57:14.330+0000", "2018-05-14 22:57:20.795+0000",
"2018-05-14 22:57:22.168+0000", "2018-05-14 22:57:24.203+0000"),
diff = c(NA, NA, 0.0687333333333333, 0.00415, 0.0676666666666667, NA, 0.0721,
0.10775, 0.0228833333333333,0.0339166666666667)),
.Names = c("conversationid", "sentby","time","diff"), row.names = c(NA, 10L),
class = "data.frame")
其中conversationid是会话ID,可以包含代理或客户发送的消息。我想做的是,只要&#34;代理&#34;保持运行计数。出现在对话中,如下:
目标输出:
conversationid sentby diff agent_counter_flag
C1 Consumer NA 0
C2 Consumer NA 0
C2 Agent 0.06873333 1
C2 Agent 0.00415 2
C2 Agent 0.06766667 3
C2 Consumer NA 0
C3 Agent 0.0721 1
C3 Consumer 0.10775 0
C3 Agent 0.02288333 2
C3 Agent 0.03391667 3
目前,我能够对数据帧进行分区,并使用以下代码对由cid分组的所有记录进行排名:
setDT(sample_df)
sample_df[,Order := rank(time, ties.method = "first"), by = "conversationid"]
sample_df <- as.data.frame(sample_df)
但它所做的只是在一个分区内对记录进行排名,而不管它是否是一个代理&#34;代理&#34;或&#34;客户&#34;。
当前输出:
conversationid sentby diff Order
C1 Consumer NA 1
C2 Consumer NA 1
C2 Agent 0.06873333 2
C2 Agent 0.00415 3
C2 Agent 0.06766667 4
C2 Consumer NA 5
C3 Agent 0.0721 1
C3 Consumer 0.10775 2
C3 Agent 0.02288333 3
C3 Agent 0.03391667 4
如何继续,以便我可以在目标输出中显示我的数据帧?谢谢!
答案 0 :(得分:2)
library(data.table)
setDT(sample_df)
sample_df[, agent_counter_flag := {sba = (sentby == 'Agent'); sba*cumsum(sba)}
, by = conversationid]
sample_df
# conversationid sentby time diff agent_counter_flag
# 1: C1 Consumer 2018-04-25 03:54:04.550+0000 NA 0
# 2: C2 Consumer 2018-05-11 19:18:05.094+0000 NA 0
# 3: C2 Agent 2018-05-11 19:18:09.218+0000 0.06873333 1
# 4: C2 Agent 2018-05-11 19:18:09.467+0000 0.00415000 2
# 5: C2 Agent 2018-05-11 19:18:13.527+0000 0.06766667 3
# 6: C2 Consumer 2018-05-14 22:57:10.004+0000 NA 0
# 7: C3 Agent 2018-05-14 22:57:14.330+0000 0.07210000 1
# 8: C3 Consumer 2018-05-14 22:57:20.795+0000 0.10775000 0
# 9: C3 Agent 2018-05-14 22:57:22.168+0000 0.02288333 2
# 10: C3 Agent 2018-05-14 22:57:24.203+0000 0.03391667 3
正如@Frank所指出的,这也有效
sample_df[, agent_counter_flag := rowid(conversationid, sentby)*(sentby == "Agent")]
基准
sample_df <- replicate(1000, sample_df, simplify = F) %>% rbindlist
microbenchmark(
rowidFrank = sample_df[, agent_counter_flag :=
rowid(conversationid, sentby)*(sentby == "Agent")]
, rowidUwe = sample_df[sentby == "Agent", agent_counter_flag := rowid(conversationid)]
, cumsum = sample_df[, agent_counter_flag := {sba = (sentby == 'Agent'); sba*cumsum(sba)}
, by = conversationid]
, unit = 'relative')
# Unit: relative
# expr min lq mean median uq max neval
# rowidFrank 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
# rowidUwe 1.448858 1.438742 1.410849 1.414428 1.535292 0.5549433 100
# cumsum 1.322493 1.306228 1.316188 1.261325 1.308371 1.6431036 100
答案 1 :(得分:1)
这是我的data.table
解决方案,该解决方案使用rowid()
函数并通过引用创建新列agent_counter_flag
:
library(data.table)
setDT(sample_df)
sample_df[sentby == "Agent", agent_counter_flag := rowid(conversationid)][]
conversationid sentby time diff agent_counter_flag 1: C1 Consumer 2018-04-25 03:54:04.550+0000 NA NA 2: C2 Consumer 2018-05-11 19:18:05.094+0000 NA NA 3: C2 Agent 2018-05-11 19:18:09.218+0000 0.06873333 1 4: C2 Agent 2018-05-11 19:18:09.467+0000 0.00415000 2 5: C2 Agent 2018-05-11 19:18:13.527+0000 0.06766667 3 6: C2 Consumer 2018-05-14 22:57:10.004+0000 NA NA 7: C3 Agent 2018-05-14 22:57:14.330+0000 0.07210000 1 8: C3 Consumer 2018-05-14 22:57:20.795+0000 0.10775000 NA 9: C3 Agent 2018-05-14 22:57:22.168+0000 0.02288333 2 10: C3 Agent 2018-05-14 22:57:24.203+0000 0.03391667 3
答案 2 :(得分:0)
你在这里:
library(dplyr)
df <- data.frame(cid = c(rep("c1", 6), rep("C2", 4)),
Sent_by = c("Consumer", "Agent", "Consumer", "Consumer", "Agent", "Agent",
"Consumer", "Agent", "Agent", "Consumer"))
df %>% group_by(cid, Sent_by) %>%
mutate(agent_flag = ifelse(Sent_by == "Agent", 1:n(), NA),
consumer_flag = ifelse(Sent_by == "Consumer", 1:n(), NA))
答案 3 :(得分:0)
通过这篇文章来解决.app-main {
font-family: Helvetica;
}
.nav-column {
font-family: Helvetica;
font-size: 18px;
background-color: aqua;
}
.content-column {
font-size: 18px;
background-color: darkkhaki;
}
的类似问题。您可以使用dplyr
的分组来对经过测试的sentby == "Agent"
的逻辑值求和。
很长一段路,只是要阐明逻辑列的外观:
dplyr
您可能想跟着library(dplyr)
sample_df %>%
mutate(is_agent = sentby == "Agent") %>%
group_by(conversationid) %>%
mutate(agent_counter_flag = ifelse(is_agent, cumsum(is_agent), 0)) %>%
ungroup()
#> # A tibble: 10 x 6
#> conversationid sentby time diff is_agent agent_counter_f…
#> <chr> <chr> <chr> <dbl> <lgl> <dbl>
#> 1 C1 Consum… 2018-04-25 03… NA FALSE 0
#> 2 C2 Consum… 2018-05-11 19… NA FALSE 0
#> 3 C2 Agent 2018-05-11 19… 0.0687 TRUE 1
#> 4 C2 Agent 2018-05-11 19… 0.00415 TRUE 2
#> 5 C2 Agent 2018-05-11 19… 0.0677 TRUE 3
#> 6 C2 Consum… 2018-05-14 22… NA FALSE 0
#> 7 C3 Agent 2018-05-14 22… 0.0721 TRUE 1
#> 8 C3 Consum… 2018-05-14 22… 0.108 FALSE 0
#> 9 C3 Agent 2018-05-14 22… 0.0229 TRUE 2
#> 10 C3 Agent 2018-05-14 22… 0.0339 TRUE 3
来删除逻辑列。
或者实际上,对于缩写形式,可以在select(-is_agent)
内调用cumsum
。
mutate
无论哪种方式,想法都是在每个sample_df %>%
group_by(conversationid) %>%
mutate(agent_counter_flag = ifelse(sentby == "Agent", cumsum(sentby == "Agent"), 0)) %>%
ungroup()
内添加conversationid
的数量(如果它是由代理发送的),或者将其设置为0(如果它不是由代理发送的)。