我有一个表示成对时间块的大型数据集,但是我希望能够在同一年开始和结束的每一行的年份边界上实现干净的休息。
作为示例,请参见下表。
type duration cumsum year year.split
1 1 236 236 1 365
2 0 129 365 1 365
3 1 154 519 2 730
4 0 216 735 3 1095
第一年和第二年之间没有重叠,因为第3行从第二年的第一天开始,但第4行从第二年开始,到第三年结束第5天。我想拆分第4行,以便表格如下所示。
type duration cumsum year year.split
1 1 236 236 1 365
2 0 129 365 1 365
3 1 0 519 1 365
4 1 154 519 2 730
5 0 211 524 2 730
6 0 5 735 3 1095
可以看出,多年来没有重叠,因为每个重叠的时间块已被拆分,因此每一行开始并在同一年结束。到目前为止我这样做的方式如下,但它似乎很笨重,我希望有一个更优雅的解决方案。
set.seed(808)
test <- data.frame(type = c(1,0), duration = round(runif(20, min = 100, max = 250))) %>%
mutate(cumsum = cumsum(duration), year = ceiling(cumsum/365), year.split = year*365 )
test <- rbind(test[1,],
filter(test, lag(year) == year),
filter(test, lag(year) != year) %>%
mutate( duration = cumsum - (year-1)*365),
filter(test, lag(year) != year) %>%
mutate( duration = ((year-1)*365 + duration- cumsum),
cumsum = cumsum-duration,
year = year -1,
year.split = year*365) ) %>% arrange(year, cumsum)
test <- group_by( test,type, year) %>%
summarise( duration = sum(duration)) %>% ungroup %>% arrange(year)
最后两行代码总结了数据,因为我对每年每种类型的总量感兴趣。
这样做的更好方法是什么?
答案 0 :(得分:2)
这似乎有效,假设持续时间都是严格肯定的:
cs<-test$cumsum
cs0<-sort(unique(c(cs,(1:floor(max(cs)/365))*365)))
data.frame(type=test$type[findInterval(cs0-0.5,cs)+1],
duration=diff(c(0,cs0)),cumsum=cs0,year=ceiling(cs0/365))
type duration cumsum year
1 1 236 236 1
2 0 129 365 1
3 1 154 519 2
4 0 211 730 2
5 0 5 735 3
答案 1 :(得分:0)
不确定它是否是您正在寻找的R
方式,但您可以简化rbind
功能:
rbind (filter(test, cumsum - duration >= (year - 1) * 365),
filter(test, cumsum - duration < (year - 1) * 365) %>%
mutate(duration = cumsum - (year - 1) * 365),
filter(test, cumsum - duration < (year - 1) * 365) %>%
mutate(year = year - 1, # I'm changing the year first so it will propagate
duration = duration - (cumsum - (year * 365)),
cumsum = (year) * 365,
year.split = year * 365)
)
如您所见,我将三个data.frame组合在一起:
我不喜欢这里有两件事:我使用了两次相同的过滤器(对于案例2和3),明天我将需要10/15分钟才能理解这段代码(或者我可以像# It works, don't worry
)。
我认为这个代码的更详细版本更容易维护:
# These don't overlap
ok <- filter(test, cumsum - duration >= (year - 1) * 365)
# These do overlap! We need to split them in two
ko <- filter(test, cumsum - duration < (year - 1) * 365)
# For the most recent year, it's enough to change the duration
ko.recent <- mutate(ko,
duration = cumsum - (year - 1) * 365
)
# For the previous year, a bit more
ko.previous <- mutate(ko,
year = year - 1, # I'm changing the year first
# so it will propagate
duration = duration - (cumsum - (year * 365)),
cumsum = (year) * 365,
year.split = year * 365
)
# Let me put them back together and sort them for you
test1 <- rbind (ok,
ko.recent,
ko.previous
)
不确定这是否是您正在寻找的答案,我只是在学习R
。