我有以下数据框:
AllDays
2012-01-01
2012-01-02
2012-01-03
...
2015-08-18
Leases
StartDate EndDate
2012-01-01 2013-01-01
2012-05-07 2013-05-06
2013-09-05 2013-12-01
我想要做的是,对于allDays数据帧中的每个日期,计算有效的租约数量。例如如果有4个租约的开始日期< = 2015-01-01和结束日期> = 2015-01-01,那么我想在该数据帧中放置一个4。
我有以下代码
for (i in 1:nrow(leases))
{
occupied = seq(leases$StartDate[i],leases$EndDate[i],by="days")
occupied = occupied[occupied < dateOfInt]
matching = match(occupied,allDays$Date)
allDays$Occupancy[matching] = allDays$Occupancy[matching] + 1
}
有效,但由于我有大约5000个租约,大约需要1.1秒。有没有人有更有效的方法需要更少的计算时间? 利息日仅为当前日期,仅用于确保其未来的租约日期不计算。
答案 0 :(得分:5)
这正是foverlaps闪耀的问题:基于另一个data.frame(foverlaps
似乎是为此目的而定制的)对data.frame进行子集化。
基于@ MichaelChirico的数据。
setkey(days[, AllDays1:=AllDays,], AllDays, AllDays1)
setkey(leases, StartDate, EndDate)
foverlaps(leases, days)[, .(lease_count=.N), AllDays]
# user system elapsed
# 0.114 0.018 0.136
# @MichaelChirico's approach
# user system elapsed
# 0.909 0.000 0.907
Here简要解释了@Arun是如何运作的,这让我开始使用data.table
。
答案 1 :(得分:4)
使用seq
几乎肯定效率低下 - 假设您的数据租约长达10000年。 seq
将永远带回并返回对我们无关紧要的10000 * 365-1天。然后我们必须使用%in%
,这也会进行相同数量的不必要的比较。
我不确定以下是最好的方法(我确信这是一个完全矢量化的解决方案),但它更接近问题的核心。
set.seed(102349)
days<-data.frame(AllDays=seq(as.Date("2012-01-01"),
as.Date("2015-08-18"),"day"))
leases<-data.frame(StartDate=sample(days$AllDays,5000L,T))
leases$EndDate<-leases$StartDate+round(rnorm(5000,mean=365,sd=100))
使用data.table
和sapply
:
library(data.table)
setDT(leases); setDT(days)
days[,lease_count:=
sapply(AllDays,function(x)
leases[StartDate<=x&EndDate>=x,.N])][]
AllDays lease_count
1: 2012-01-01 5
2: 2012-01-02 8
3: 2012-01-03 11
4: 2012-01-04 16
5: 2012-01-05 18
---
1322: 2015-08-14 1358
1323: 2015-08-15 1358
1324: 2015-08-16 1360
1325: 2015-08-17 1363
1326: 2015-08-18 1359
答案 2 :(得分:2)
如果没有您的数据,我无法测试这是否更快,但它可以用更少的代码完成工作:
for (i in 1:nrow(AllDays)) AllDays$tally[i] = sum(AllDays$AllDays[i] >= Leases$Start.Date & AllDays$AllDays[i] <= Leases$End.Date)
我使用以下方法进行测试;请注意,两个数据框中的相关列都格式化为日期:
AllDays = data.frame(AllDays = seq(from=as.Date("2012-01-01"), to=as.Date("2015-08-18"), by=1))
Leases = data.frame(Start.Date = as.Date(c("2013-01-01", "2012-08-20", "2014-06-01")), End.Date = as.Date(c("2013-12-31", "2014-12-31", "2015-05-31")))
答案 3 :(得分:1)
另一种方法,但我不确定它的速度更快。
library(lubridate)
library(dplyr)
AllDays = data.frame(dates = c("2012-02-01","2012-03-02","2012-04-03"))
Lease = data.frame(start = c("2012-01-03","2012-03-01","2012-04-02"),
end = c("2012-02-05","2012-04-15","2012-07-11"))
# transform to dates
AllDays$dates = ymd(AllDays$dates)
Lease$start = ymd(Lease$start)
Lease$end = ymd(Lease$end)
# create the range id
Lease$id = 1:nrow(Lease)
AllDays
# dates
# 1 2012-02-01
# 2 2012-03-02
# 3 2012-04-03
Lease
# start end id
# 1 2012-01-03 2012-02-05 1
# 2 2012-03-01 2012-04-15 2
# 3 2012-04-02 2012-07-11 3
data.frame(expand.grid(AllDays$dates,Lease$id)) %>% # create combinations of dates and ranges
select(dates=Var1, id=Var2) %>%
inner_join(Lease, by="id") %>% # join information
rowwise %>%
do(data.frame(dates=.$dates,
flag = ifelse(.$dates %in% seq(.$start,.$end,by="1 day"),1,0))) %>% # create ranges and check if the date is in there
ungroup %>%
group_by(dates) %>%
summarise(N=sum(flag))
# dates N
# 1 2012-02-01 1
# 2 2012-03-02 1
# 3 2012-04-03 2
答案 4 :(得分:0)
尝试使用lubridate包。为每个租约创建一个间隔。然后计算每个日期所在的租约间隔。
# make some data
AllDays <- data.frame("Days" = seq.Date(as.Date("2012-01-01"), as.Date("2012-02-01"), by = 1))
Leases <- data.frame("StartDate" = as.Date(c("2012-01-01", "2012-01-08")),
"EndDate" = as.Date(c("2012-01-10", "2012-01-21")))
library(lubridate)
x <- new_interval(Leases$StartDate, Leases$EndDate, tzone = "UTC")
AllDays$NumberInEffect <- sapply(AllDays$Days, function(a){sum(a %within% x)})
输出
head(AllDays)
Days NumberInEffect
1 2012-01-01 1
2 2012-01-02 1
3 2012-01-03 1
4 2012-01-04 1
5 2012-01-05 1
6 2012-01-06 1