Question

我希望使用data.table来提高给定功能的速度，但我不确定我是否以正确的方式实现它：

数据

鉴于两个data.table s（dt和dt_lookup）

library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n), 
                 thisTime=sample(t, n, replace=TRUE), 
                 thisLocation=sample(la,n,replace=TRUE),
                 finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)

set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
                        lkpTime=sample(t, 10000, replace=TRUE),
                        lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)

我有一个功能，可以找到包含lkpId和thisLocation的{{1}}，并且具有最近的＆＃39; finalLocation（即lkpTime的最小非负值）

功能

thisTime - lkpTime

尝试解决方案

我需要为## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation, ## with the minimum non-negative time between thisTime and dt_lookup$lkpTime) getId <- function(thisTime, thisLocation, finalLocation){ ## filter lookup based on thisLocation and finalLocation, ## and only return values where the lkpId has both 'this' and 'final' locations tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId]) tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId]) availServices <- tempThis[tempThis %in% tempFinal] tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)] ## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation) temp2 <- thisTime - tempThisFinal$lkpTime ## take the lkpId with the minimum non-negative difference selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId] selectedId }的每一行获取lkpId。因此，我最初的本能是使用dt函数，但在*apply时它（对我而言）花了太长时间。因此，我尝试实施n/nrow > 1,000,000解决方案，看看它是否更快：

data.table

但是，我对selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]相当新，而且这种方法似乎没有比data.table解决方案带来任何性能提升：

*apply

n = 10,000时约需30秒。

问题

是否有更好的方法可以使用lkpIds <- apply(dt, 1, function(x){ thisLocation <- as.character(x[["thisLocation"]]) finalLocation <- as.character(x[["finalLocation"]]) thisTime <- as.numeric(x[["thisTime"]]) myId <- getId(thisTime, thisLocation, finalLocation) })在data.table的每一行上应用getId函数？

2015年12月8日更新

感谢来自@eddi的指针，我重新设计了我的整个算法并使用滚动连接（a good introduction），从而正确使用dt。我稍后会写一个答案。

Answer 1

花了一些时间，因为问这个问题调查what data.table has to offer，研究data.table加入感谢@ eddi的指针（例如Rolling join on data.table和inner join with inequality ），我已经找到了解决方案。

其中一个棘手的部分是放弃了将功能应用到每一行的想法，并重新设计解决方案以使用连接。

并且毫无疑问会有更好的编程方式，但这是我的尝试。

## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'

## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)

dt_this <- dt[dt_lookup, {
  idx = thisTime - i.lkpTime > 0
  .(id = id[idx],
    lkpId = i.lkpId,
    thisTime = thisTime[idx],
    lkpTime = i.lkpTime)
},
by=.EACHI]

## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]

## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]

## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)

dt_join <- dt_this[dt_final, nomatch=0]

## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]

dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]  

## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
#   group_by(id) %>%
#   arrange(timeDiff) %>%
#   slice(1) %>%
#   ungroup

r - 将函数应用于data.table

1 个答案: