我有一个data.table对象,其中包含时间戳(在午夜后测量为秒)。我的目标是运行一个函数,该函数为每一行返回观察前最多$ k $秒发生的观察数。
require(data.table, dplyr, dtplyr)
set.seed(123)
DF <- data.frame(Secs=cumsum(rexp(10000,1)))
setDT(DF)
> DF
Secs
1: 8.434573e-01
2: 1.420068e+00
3: 2.749122e+00
4: 2.780700e+00
5: 2.836911e+00
---
9996: 1.003014e+04
9997: 1.003382e+04
9998: 1.003384e+04
9999: 1.003414e+04
10000: 1.003781e+04
我想要应用于每一行的函数是
nS<-function(Second,k=5)
max(1,nrow(DF%>%filter(Secs<Second & Secs>=Second-k)))
获得我想要的东西的一种方法是使用apply,这需要相当长的时间。
system.time(val <- apply(DF,1,nS))
User System verstrichen
20.56 0.03 20.66
#Not working
DF%>%mutate(nS=nS(Secs,100))%>%head()
# Also not working
library(lazyeval)
f = function(col1, new_col_name) {
mutate_call = lazyeval::interp(~ nS(a), a = as.name(col1))
DF%>%mutate_(.dots=setNames(list(mutate_call),new_col_name))
}
head(f('Secs', 'nS'))
DF%>%mutate(minTime=Secs-k)%>%head()
是否有可能通过使用mutate来实现这种方法? 非常感谢你的帮助!
答案 0 :(得分:2)
使用rowwise()
是否适合您?
DF %>% rowwise() %>% mutate(ns = nS(Secs), # default k = 5, equal to your apply
ns2 = nS(Secs, 100)) # second test case k = 100
Source: local data frame [10,000 x 3]
Groups: <by row>
# A tibble: 10,000 × 3
Secs ns ns2
<dbl> <dbl> <dbl>
1 0.1757671 1 1
2 1.1956531 1 1
3 1.6594676 2 2
4 2.6988685 3 3
5 2.8845783 4 4
6 3.1012975 5 5
7 4.1258548 6 6
8 4.1584318 7 7
9 4.2346702 8 8
10 6.0375495 8 9
# ... with 9,990 more rows
它在我的机器上只比apply
略快......
system.time(DF %>% rowwise() %>% mutate(ns = nS(Secs)))
user system elapsed
13.934 1.060 15.280
system.time(apply(DF, 1, nS))
user system elapsed
14.938 1.101 16.438
答案 1 :(得分:2)
如果您可以完全不使用dplyr,则速度非常快:
applyNS <- function(s,k=5) {
cnt <- numeric(length(s))
for(i in 1:length(s)) {
res <- (s[(1+i):length(s)] - s[1:(length(s)-i)]) <= k
cnt[(1+i):length(s)] <- cnt[(1+i):length(s)] + res
if(!any(res)) break
}
cnt
}
该函数假定s
按升序排序。
此函数的结果略有不同:即使与前一时间戳的差异已大于k,您的代码也会计算一次。但这很容易调整,然后结果是一样的:
DF <- data.frame(Secs=cumsum(rexp(10000,1)))
nS<-function(Second,k=5)
max(1,nrow(DF%>%filter(Secs<Second & Secs>=Second-k)))
result <- apply(DF,1,nS)
result1 <- applyNS(DF$Secs)
result1[result1 == 0] <- 1
print(all(result - result1 == 0))
打印出来&#39; [1] TRUE&#39;。请注意,此实现更多更快:
> system.time(apply(DF, 1, nS))
User System verstrichen
8.31 0.00 8.43
> system.time(replicate(100,{result1 <- applyNS(DF$Secs); result1[result1 == 0] <- 1}))/100
User System verstrichen
0.0071 0.0000 0.0073