我有多个湖泊的数据集,这些湖泊的水位会随时间而升高。观测值没有规则的间隔并且有很多大的空白。此外,一些较旧的观察结果可能质量较低或未知。我创建了一个单独的模型,该模型可以很好地预测整个时间段的水位,但仍会因变化的量而错过实际的观测结果。
我想创建第三组输入/内插数据集,其中的解决方案是:
由缺少观察值的模型值通知 穿越高度加权的观察 并通过较低的加权观测值得到通知
到目前为止,我已经使用寓言包的TSLM-> interpolate来执行此操作。它工作得相当好,但是我看不到在过程中引入权重的方法。此外,它在很大程度上依赖于全局系数,并且在建模值明显未达到所观察到的数值时,会对其进行截取,使其变得有些不稳定。我认为我需要使用某种依赖于局部系数且可以适应加权的加权黄土。
library(dplyr)
library(tsibble)
library(fable)
library(ggplot2)
test_data <- data.frame(obs_year = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009),
site_name = c("Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2"),
observed = c(100,200,NA, NA, NA, NA, 220, NA, NA, 125, NA,NA,425, NA, 475, NA, 450, 450, 475, 500),
weights = c(1,1,NA, NA, NA, NA, 2, NA, NA, 2, NA,NA,2, NA, 1, NA, 2, 2, 2, 2),
modeled = c(110,120,165,150, 200, 225, 240, 250, 150, 130, 450,430,415,400, 425, 450, 460, 460, 470, 490))
test_tsibble <- as_tsibble(test_data, key = site_name, index = obs_year)
tslm_interpolate <- test_tsibble %>%
group_by(site_name) %>%
model(lm = TSLM(observed~modeled)) %>%
fabletools::interpolate(test_tsibble)
tslm_interpolate <- left_join(tslm_interpolate, test_data, by = c("site_name", "obs_year")) %>%
dplyr::select(obs_year, site_name, observed = observed.y, imputed = observed.x, modeled, weights)
tslm_interpolate %>%
ggplot(aes(x=obs_year))+
geom_line(aes(y = imputed), color = "blue")+
geom_line(aes(y = modeled), color = "red")+
geom_point(aes(y = observed), color = "green")+
facet_wrap(~site_name, scales = "free_y")