使用数据表而不是plyr匹配数据的任何元素中的模式

时间:2017-07-05 18:01:39

标签: r

我有一个非常大的数据集,之前没有使用过data.table。我发现语法有点难以理解。我的主要问题是如何重现“适用”和“应用”。数据表的功能?

我的数据如下

dat1 <- structure(list(id = c(1L, 1L, 2L, 3L), diag1 = structure(1:4, .Label = c("I20.1","I21.3", "I48", "I60.8"), class = "factor"), diag2 = structure(c(3L,2L, 1L, 1L), .Label = c("", "I50", "I60.9"), class = "factor"), diag3 = structure(c(1L, 2L, 1L, 1L), .Label = c("", "I38.1"), class = "factor")), .Names = c("id", "diag1", "diag2", "diag3"), row.names = c(NA, -4L), class = "data.frame")

我想为所有在I20,I21或I60的diag1,diag2或diag 3列中都有诊断代码的记录添加变量。使用apply和regex我已经完成了以下操作。

code.list <- c("I20","I21","I60")    
dat1$index <- apply(dat1[2:4],1, function(i) any(grep(paste(code.list,
collapse="|"), i)))

我得到了我想要的最终数据集如下所示

structure(list(id = c(1L, 1L, 2L, 3L), diag1 = structure(1:4, .Label = c("I20.1","I21.3", "I48", "I60.8"), class = "factor"), diag2 = structure(c(3L,2L, 1L, 1L), .Label = c("", "I50", "I60.9"), class = "factor"),diag3 = structure(c(1L, 2L, 1L, 1L), .Label = c("", "I38.1"), class = "factor"), index = c(TRUE, TRUE, FALSE, TRUE)), .Names = c("id","diag1", "diag2", "diag3", "index"), row.names = c(NA, -4L), class = "data.frame")

然而,使用plyr需要太长时间。我希望得到数据表的语法。有人能帮忙吗?

提前致谢

A

1 个答案:

答案 0 :(得分:0)

我们可以使用data.table

执行此操作
library(data.table)
setDT(dat1)[, index := Reduce(`|`, lapply(.SD, grepl,
         pattern = paste(code.list, collapse="|"))), .SDcols = 2:4]
dat1
#    id diag1 diag2 diag3 index
#1:  1 I20.1 I60.9        TRUE
#2:  1 I21.3   I50 I38.1  TRUE
#3:  2   I48             FALSE
#4:  3 I60.8              TRUE