我有这个数据框(df
):
structure(list(from = c("(192) 242-2345", NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "(832) 345-3168",
NA, NA), to = c("(900) 301-3451", NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "(900) 234-1231",
NA, NA), action_result = c("Voicemail", "No Answer", "No Answer",
"No Answer", "No Answer", "No Answer", "No Answer", "No Answer",
"No Answer", "IP Phone Offline", "No Answer", "No Answer", "Busy",
"Busy", "No Answer", "No Answer", "No Answer", "No Answer", "No Answer",
"No Answer", "No Answer", "Busy", "IP Phone Offline", "Busy",
"No Answer", "No Answer", "No Answer", "No Answer", "No Answer",
"IP Phone Offline", "IP Phone Offline", "No Answer", "No Answer",
"IP Phone Offline", "No Answer", "No Answer", "Busy", "Missed",
"Hang Up", "Hang Up")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -40L))
第一行包含拨打电话并接听电话的电话号码。那么,第一行之后的行都是NA值。因此,将行1-37视为一组,然后将行38至40视为第二组。我想检查每个组是否在Call Connected
列中包含值action_result
。
我尝试了group_by
和from
的值,但是我正在使用的整个数据集都有重复的to
和from
对值,因此不起作用。我想要一个to
解决方案,以检查前37行是否包含dplyr
并输出带有列的数据框:
Call Connected
,from
,to
,其中CallConnected
是1,是,0是否。
S0,查看CallConnected
,结果数据集将有2行:
df
答案 0 :(得分:2)
使用tidyverse
软件包的解决方案,或者您只需加载dplyr
和tidyr
软件包即可实现这一目标。
想法是在NA
和from
列中用最接近的非NA值填充to
。之后,使用action_result == "CallConnected"
检查是否有与"CallConnected"
匹配的项目,是否按from
和to
分组,以及summarize
和sum
进行计数总匹配记录。
library(tidyverse)
df2 <- df %>%
fill(from) %>%
fill(to) %>%
mutate(CallConnected = action_result == "CallConnected") %>%
group_by(from, to) %>%
summarize(CallConnected = sum(CallConnected)) %>%
ungroup()
df2
# # A tibble: 2 x 3
# from to CallConnected
# <chr> <chr> <int>
# 1 (192) 242-2345 (900) 301-3451 0
# 2 (832) 345-3168 (900) 234-1231 0
更新
如果需要考虑重复,我们可以使用rleid
包中的data.table
在fill
函数之后创建ID。下面是一个示例。
library(tidyverse)
library(data.table)
# Create an example with duplication
df_dup <- bind_rows(df, df %>% slice(1:5))
df_dup2 <- df_dup %>%
fill(from) %>%
fill(to) %>%
mutate(ID = rleid(from, to)) %>%
mutate(CallConnected = action_result == "CallConnected") %>%
group_by(ID, from, to) %>%
summarize(CallConnected = sum(CallConnected)) %>%
ungroup()
df_dup2
# # A tibble: 3 x 4
# ID from to CallConnected
# <int> <chr> <chr> <int>
# 1 1 (192) 242-2345 (900) 301-3451 0
# 2 2 (832) 345-3168 (900) 234-1231 0
# 3 3 (192) 242-2345 (900) 301-3451 0