我正在进行分析,我正在计算一个部分基于滚动天数的费率。我正在使用dplyr和group_by / summary / mutate操作执行此计算。
但是,滚动总和的增量因组而异。理想情况下,我每30天测量一次。但是,有时测量间隔为60或90天。
例如:
df <- data.frame( ID = "Subject A",
cumulative_days = c(30, 60, 90, 180, 270, 360),
rolling_percent = c(.8, .6, .6, .4, .3, .2))
我想把这个小组改成:
result <- data.frame(ID = "Subject A",
month = seq(1,12),
rolling_percent = c(.8, .6, .6, NA, NA, .4, NA, NA, .3, NA, NA, .2))
如果我能够达到&#39;结果&#39;上面的数据框,我的计划是利用这里描述的dplyr / zoo解决方案:fill in NA based on the last non-NA value for each group in R
我可以用最后一次非NA观察来填写NA。
换句话说,我希望能够将N个观测值累加到12个观测值中,累计加起来为360。那时,我相信我可以应用其他链接的解决方案来解决我的问题。
我很难清楚地描述这种情况,所以任何有关澄清我的问题的建议都会受到赞赏。
答案 0 :(得分:2)
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
dt[, .(ID, month = cumulative_days/30, rolling_percent)][
CJ(ID = unique(ID), month = 1:12), on = c('ID', 'month')]
# ID month rolling_percent
# 1: Subject A 1 0.8
# 2: Subject A 2 0.6
# 3: Subject A 3 0.6
# 4: Subject A 4 NA
# 5: Subject A 5 NA
# 6: Subject A 6 0.4
# 7: Subject A 7 NA
# 8: Subject A 8 NA
# 9: Subject A 9 0.3
#10: Subject A 10 NA
#11: Subject A 11 NA
#12: Subject A 12 0.2
# or simply make it a rolling join to achieve your desired final result
dt[, .(ID, month = cumulative_days/30, rolling_percent)][
CJ(ID = unique(ID), month = 1:12), on = c('ID', 'month'), roll = T]
# ID month rolling_percent
# 1: Subject A 1 0.8
# 2: Subject A 2 0.6
# 3: Subject A 3 0.6
# 4: Subject A 4 0.6
# 5: Subject A 5 0.6
# 6: Subject A 6 0.4
# 7: Subject A 7 0.4
# 8: Subject A 8 0.4
# 9: Subject A 9 0.3
#10: Subject A 10 0.3
#11: Subject A 11 0.3
#12: Subject A 12 0.2
除了上面的列选择之外,您只需添加一个新的month
列:
dt[, month := cumulative_days/30][
CJ(ID = unique(ID), month = 1:12), on = c('ID', 'month'), roll = T]
# ID cumulative_days rolling_percent month
# 1: Subject A 30 0.8 1
# 2: Subject A 60 0.6 2
# 3: Subject A 90 0.6 3
# 4: Subject A 90 0.6 4
# 5: Subject A 90 0.6 5
# 6: Subject A 180 0.4 6
# 7: Subject A 180 0.4 7
# 8: Subject A 180 0.4 8
# 9: Subject A 270 0.3 9
#10: Subject A 270 0.3 10
#11: Subject A 270 0.3 11
#12: Subject A 360 0.2 12
答案 1 :(得分:1)
这是一个将data.frame与完整的
连接起来的解决方案library(dplyr)
df$month<-df$cumulative_days/30
result<-data.frame(ID = "Subject A",month=seq(1,max(df$month))) %>% left_join(df) %>%
select(-cumulative_days)
如果您要将解决方案应用于不同的ID,例如此假数据集:
df <- data.frame( ID = "Subject A",
cumulative_days = c(30, 60, 90, 180, 270, 360),
rolling_percent = c(.8, .6, .6, .4, .3, .2))
df2 <- data.frame( ID = "Subject B",
cumulative_days = c(30, 90, 120, 180, 270, 360),
rolling_percent = c(.6, .4, .3, .2, .1, .6))
df<-rbind(df,df2)
你可以将前面的代码声明为函数,然后根据ID分割大数据帧并单独应用函数,最后将所有函数绑定在一起。所以代码就像:
buildDf<-function(df){
df$month<-df$cumulative_days/30
data.frame(ID = df$ID[1],month=seq(1,max(df$month))) %>%
left_join(df) %>% select(-cumulative_days)
}
listDf<-split(df,f=df$ID)
listDfFiltered<-lapply(listDf,buildDf)
result<-do.call('rbind',listDfFiltered)
希望这有帮助
答案 2 :(得分:1)
我们可以使用base R
执行此操作。通过除以30来创建“月份”列。然后,使用expand.grid
获取包含“ID”和“range
合并”,
的所有组合的data.frame使用原始数据集,以便为“ID”,“月份”组合的'rolling_percent'获取NA,这是'df'中找不到的。
df$month <-df$cumulative_days/30
merge(expand.grid(ID = unique(df$ID),
month=Reduce(`:`, range(df$month))), df[-2], all.x=TRUE)
# ID month rolling_percent
#1 Subject A 1 0.8
#2 Subject A 2 0.6
#3 Subject A 3 0.6
#4 Subject A 4 NA
#5 Subject A 5 NA
#6 Subject A 6 0.4
#7 Subject A 7 NA
#8 Subject A 8 NA
#9 Subject A 9 0.3
#10 Subject A 10 NA
#11 Subject A 11 NA
#12 Subject A 12 0.2