我需要sapply
根据时间差是否超过某个阈值(在我的情况下,由for循环设置的天数)返回一个布尔值列表。
示例数据(日期已使用as.Date
转换):
#DF called "held"
ID Result Start_Date
123 0 12/5/2013
123 0 12/12/2013
123 0 12/31/2013
123 0 4/22/2014
123 1 4/23/2014
654 0 9/3/2013
654 0 9/17/2013
98 0 10/18/2013
98 0 10/19/2013
98 2 12/20/2013
555 0 2/1/2014
555 0 3/2/2014
555 0 3/3/2014
66 1 1/12/2013
代码:
#empty vectors to be populated for plotting
a <- c()
b <- c()
for (n in 1:60){
#all rows where ID is not duplicated and Result is either 1 or 2 are FALSE
#all ID's where the difference between the min and max Start_Date (across multiple rows) exceeds the threshold are TRUE
held$CHNS <-((!(!(held$ID %in% held$ID[duplicated(held$ID) | duplicated(held$ID, fromLast = TRUE)])&(held$Result %in% c(1,2)))) & (sapply(held$ID,function(x) max(held$Start_Date[held$ID == x]) - min(held$Start_Date[held$ID == x]) > n)))
#find percentage of Results 1 and 2 in entire CHNS population
m <- length(held$Result[held$Result %in% c(1,2) & held$CHNS == TRUE])/nrow(held[held$CHNS == TRUE,])
#assign vector elements
a[n] <- n
b[n] <- m
}
目前的代码似乎准确,但速度极慢。关于如何改进的任何提示?我应该使用sapply
吗?谢谢!
答案 0 :(得分:2)
这可以很好地矢量化,如下所示。
held <- read.table(text=
'ID Result Start_Date
123 0 12/5/2013
123 0 12/12/2013
123 0 12/31/2013
123 0 4/22/2014
123 1 4/23/2014
654 0 9/3/2013
654 0 9/17/2013
98 0 10/18/2013
98 0 10/19/2013
98 2 12/20/2013
555 0 2/1/2014
555 0 3/2/2014
555 0 3/3/2014
66 1 1/12/2013', header=TRUE)
held$Start_Date <- as.Date(held$Start_Date, '%m/%d/%Y')
# Add a column giving the number of days spanned for the ID
held$date.diff <- with(held, {
ndays <- tapply(Start_Date, ID, function(x) diff(range(x)))
ndays[match(ID, names(ndays))]
})
sapply(1:60, function(n) {
with(held, {
rule1 <- !duplicated(ID) & Result %in% 1:2
rule2 <- date.diff > n
outcome <- !rule1 & rule2
sum(outcome & Result %in% 1:2) / sum(outcome)
})
})
# [1] 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462
# [8] 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1818182
# [15] 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182
# [22] 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182
# [29] 0.1818182 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
# [36] 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
# [43] 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
# [50] 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
# [57] 0.2500000 0.2500000 0.2500000 0.2500000
快速基准:
microbenchmark(jbaums(), userNaN())
# Unit: milliseconds
# expr min lq median uq max neval
# jbaums() 1.994695 2.110046 2.164258 2.223137 3.685502 100
# userNaN() 110.448790 112.985603 114.911328 117.714080 489.052823 100
答案 1 :(得分:1)
首先,我会在循环之外找到每个ID的差异。然后,如果必须,请在1:60的循环中进行差异检查。我也会使用dplyr来计算差异,这应该会大大简化代码并且可能使代码更快。使用您的示例:
require(dplyr)
ID <- group_by(held, ID)
Diff <- summarise(ID, Difference = (max(Start_Date) - min(Start_Date)))
a <- 1:60
b <- vector('numeric', 60)
for n in (1:60) {
b[n] <- mean (Diff$Difference > n)
}
那应该给你一个向量b,其中人口的差异大于每个n级的时间百分比。