检查值是否存在于数据框中的特定行组中

时间:2019-04-05 00:42:13

标签: r dataframe dplyr

我有这个数据框(df):

structure(list(from = c("(192) 242-2345", NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "(832) 345-3168", 
NA, NA), to = c("(900) 301-3451", NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "(900) 234-1231", 
NA, NA), action_result = c("Voicemail", "No Answer", "No Answer", 
"No Answer", "No Answer", "No Answer", "No Answer", "No Answer", 
"No Answer", "IP Phone Offline", "No Answer", "No Answer", "Busy", 
"Busy", "No Answer", "No Answer", "No Answer", "No Answer", "No Answer", 
"No Answer", "No Answer", "Busy", "IP Phone Offline", "Busy", 
"No Answer", "No Answer", "No Answer", "No Answer", "No Answer", 
"IP Phone Offline", "IP Phone Offline", "No Answer", "No Answer", 
"IP Phone Offline", "No Answer", "No Answer", "Busy", "Missed", 
"Hang Up", "Hang Up")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -40L))

第一行包含拨打电话并接听电话的电话号码。那么,第一行之后的行都是NA值。因此,将行1-37视为一组,然后将行38至40视为第二组。我想检查每个组是否在Call Connected列中包含值action_result

我尝试了group_byfrom的值,但是我正在使用的整个数据集都有重复的tofrom对值,因此不起作用。我想要一个to解决方案,以检查前37行是否包含dplyr并输出带有列的数据框:

Call Connectedfromto,其中CallConnected是1,是,0是否。

S0,查看CallConnected,结果数据集将有2行:

df

1 个答案:

答案 0 :(得分:2)

使用tidyverse软件包的解决方案,或者您只需加载dplyrtidyr软件包即可实现这一目标。

想法是在NAfrom列中用最接近的非NA值填充to。之后,使用action_result == "CallConnected"检查是否有与"CallConnected"匹配的项目,是否按fromto分组,以及summarizesum进行计数总匹配记录。

library(tidyverse)

df2 <- df %>%
  fill(from) %>%
  fill(to) %>%
  mutate(CallConnected = action_result == "CallConnected") %>%
  group_by(from, to) %>%
  summarize(CallConnected = sum(CallConnected)) %>%
  ungroup()
df2
# # A tibble: 2 x 3
#   from           to             CallConnected
#   <chr>          <chr>                  <int>
# 1 (192) 242-2345 (900) 301-3451             0
# 2 (832) 345-3168 (900) 234-1231             0

更新

如果需要考虑重复,我们可以使用rleid包中的data.tablefill函数之后创建ID。下面是一个示例。

library(tidyverse)
library(data.table)

# Create an example with duplication
df_dup <- bind_rows(df, df %>% slice(1:5))

df_dup2 <- df_dup %>%
  fill(from) %>%
  fill(to) %>%
  mutate(ID = rleid(from, to)) %>%
  mutate(CallConnected = action_result == "CallConnected") %>%
  group_by(ID, from, to) %>%
  summarize(CallConnected = sum(CallConnected)) %>%
  ungroup()
df_dup2
# # A tibble: 3 x 4
#      ID from           to             CallConnected
#   <int> <chr>          <chr>                  <int>
# 1     1 (192) 242-2345 (900) 301-3451             0
# 2     2 (832) 345-3168 (900) 234-1231             0
# 3     3 (192) 242-2345 (900) 301-3451             0