如何通过合并grep函数对管道数据集中的多个变量进行计数?

时间:2019-06-12 08:39:30

标签: r dplyr pipe grepl

我需要使用管道一次性计算数据集中的多个变量。

我使用了以下代码:

#R
NonComp_Strat <- Minor_Behaviours %>% 
filter(Categories == "Non compliant with routine") %>% 
group_by(Strategies) %>% 
summarise(frequency= n())

但是,在我的数据框中,某些单元格包含多个用逗号分隔的条目。

例如

它以不同的方式对待以下行为条目“破坏性”和“破坏性,关闭任务”。

数据框中的两个行为条目都具有我要查找的变量,但我不知道如何将grep或grepl函数包装到管道中以计算所有单个变量。其中有20多个,执行20多个单独的grep函数听起来很糟糕。任何帮助是极大的赞赏。

谢谢

3 个答案:

答案 0 :(得分:1)

您首先必须拆分逗号分隔的值,并在其中创建新行。然后,您可以像以前一样group_by

library(splitstackshape)
df <- data.frame(id = c(1:4), Strategies = c("Disruptive", "Disruptive, Off Task", "Off Task", "Off Task, Interview"))
df
  id           Strategies
1  1           Disruptive
2  2 Disruptive, Off Task
3  3             Off Task
4  4  Off Task, Interview
df <- cSplit(df, "Strategies", ",", "long")
df
   id Strategies
1:  1 Disruptive
2:  2 Disruptive
3:  2   Off Task
4:  3   Off Task
5:  4   Off Task
6:  4  Interview

答案 1 :(得分:0)

在一个dplyrtidyr工作流程中:

df %>%
    separate(Strategies, paste("Strategies", 1:5, sep = "_"), extra = "drop", sep = ",") %>%
    gather(Stacked, Strategies, Strategies_1:Strategies_5) %>%
    select(-Stacked) %>%
    na.omit() %>%
    mutate(Strategies = as.factor(trimws(Strategies))) %>%
    group_by(Strategies) %>%
    summarise(count = n()) 



  Strategies     count
  <fct>          <int>
1 Brief Time Out     1
2 Detention          2
3 Disruptive         2
4 Interview          1
5 Off Task           1

答案 2 :(得分:0)

更笼统地说,我们可以设计一个生成reshape可用数据的拆分函数。

spltCol <- function(x) {
  l <- strsplit(as.character(x), ", ?")
  l <- lapply(l, function(y) c(y, rep(NA, max(lengths(l)) - length(y))))
  return(as.data.frame(do.call(rbind, l)))
}

示例

df1
#   id                  x          z
# 1  1 alpha, beta, gamma  0.7281856
# 2  2        alpha, beta -0.3149730
# 3  3              alpha -2.6412875
# 4  4               <NA>  0.6412990

df12 <- data.frame(append(df1[-2], spltCol(df1$x)))
#   id          z    V1   V2    V3
# 1  1  0.7281856 alpha beta gamma
# 2  2 -0.3149730 alpha beta  <NA>
# 3  3 -2.6412875 alpha <NA>  <NA>
# 4  4  0.6412990  <NA> <NA>  <NA>

reshape(df12, direction="long", varying=cbind("V1", "V2", "V3"), v.names=names(df1)[2])
#     id          z time     x
# 1.1  1  0.7281856    1 alpha
# 2.1  2 -0.3149730    1 alpha
# 3.1  3 -2.6412875    1 alpha
# 4.1  4  0.6412990    1  <NA>
# 1.2  1  0.7281856    2  beta
# 2.2  2 -0.3149730    2  beta
# 3.2  3 -2.6412875    2  <NA>
# 4.2  4  0.6412990    2  <NA>
# 1.3  1  0.7281856    3 gamma
# 2.3  2 -0.3149730    3  <NA>
# 3.3  3 -2.6412875    3  <NA>
# 4.3  4  0.6412990    3  <NA>

数据

df1 <- structure(list(id = 1:4, x = structure(c(3L, 2L, 1L, NA), .Label = c("alpha", 
"alpha, beta", "alpha, beta, gamma"), class = "factor"), z = c(0.72818559355044, 
-0.314973049072542, -2.64128753187138, 0.641298995312115)), class = "data.frame", row.names = c(NA, 
-4L))