使用data.table基于多列子行设置 - 最快的方式

时间:2017-12-01 13:28:31

标签: r

我想知道是否有更优雅,更笨重和更快的方式来做到这一点。我有数百万行ICD编码用于临床数据。下面提供一个简短的例子。我是基于满足特定诊断代码集的任一列来对数据集进行子集化。下面的代码有效但在R中需要很长时间,并且想知道是否有更快的方法。

structure(list(eid = 1:10, mc1 = structure(c(4L, 3L, 5L, 2L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("345", "410", "413.9", "I20.1", 
"I23.4"), class = "factor"), oc1 = c(350, 323, 12, 35, 413.1, 
345, 345, 345, 345, 345), oc2 = structure(c(5L, 6L, 4L, 1L, 1L, 
2L, 2L, 2L, 3L, 2L), .Label = c("", "345", "I20.3", "J23.6", 
"K50.1", "K51.4"), class = "factor")), .Names = c("eid", "mc1", 
"oc1", "oc2"), class = c("data.table", "data.frame"), row.names = c(NA, 
-10L), .internal.selfref = <pointer: 0x102812578>)

下面的代码子集所有符合&#34; I20&#34;的代码的行。或&#34; 413&#34; (这将包括所有代码,例如编码为&#34; I20.4&#34;或&#34; 413.9&#34;等。

dat2 <- dat [substr(dat$mc1,1,3)== "413"|
           substr(dat$oc1,1,3)== "413"|
           substr(dat$oc2,1,3)== "413"|
           substr(dat$mc1,1,3)== "I20"|
           substr(dat$oc1,1,3)== "I20"|
           substr(dat$oc2,1,3)== "I20"]

有更快的方法吗?例如,我可以循环遍历每个列,寻找特定代码&#34; I20&#34;或&#34; 413&#34;并将这些行子集化?

2 个答案:

答案 0 :(得分:1)

我们可以在.SDcols中指定感兴趣的列,循环遍历Data.table的子集(.SD),使用substr获取前3个字符,检查它是否为{ {1}}一个%in%个值和vector一个逻辑Reduce用于对行进行子集

vector

答案 1 :(得分:0)

对于较大的数据,如果我们不填充所有行,它可能会有所帮助:

minem <- function(dt, colsID = 2:4) {
  cols <- colnames(dt)[colsID]
  x <- c('413', 'I20')
  set(dt, j = "inn", value = F)
  for (i in cols) {
    dt[inn == F, inn := substr(get(i), 1, 3) %chin% x]
  }
  dt[inn == T][, inn := NULL][]
}

n <- 1e7
set.seed(13)
dt <- dts[sample(.N, n, replace = T)]
dt <- cbind(dt, dts[sample(.N, n, replace = T), 2:4])
setnames(dt, make.names(colnames(dt), unique = T))
dt
#           eid   mc1   oc1   oc2 mc1.1 oc1.1 oc2.1
#        1:   8   345 345.0   345   345   345   345
#        2:   3 I23.4  12.0 J23.6 413.9   323 K51.4
#        3:   4   410  35.0       413.9   323 K51.4
#        4:   1 I20.1 350.0 K50.1 I23.4    12 J23.6
#        5:  10   345 345.0   345   345   345   345
#      ---                                        
#  9999996:   3 I23.4  12.0 J23.6 I20.1   350 K50.1
#  9999997:   5   345 413.1       I20.1   350 K50.1
#  9999998:   4   410  35.0         345   345   345
#  9999999:   4   410  35.0         410    35      
# 10000000:  10   345 345.0   345   345   345 I20.3

system.time(r1 <- akrun(dt, 2:ncol(dt))) # 22.88 sek
system.time(r2 <- minem(dt, 2:ncol(dt))) # 17.72 sek
all.equal(r1, r2)
# [1] TRUE