插值并推断数据帧中的缺失列

时间:2013-08-29 20:53:46

标签: r interpolation extrapolation

我有一个包含每4或5年数据的数据框。我希望多年内插入数据而不是数据帧中的数据,以及数据帧两端列的外推数据。

我已经能够使用下面的代码执行插值。唯一的问题是中间列会重复,必须删除一个副本。有更有效的插值方法吗?我也不确定如何攻击外推法。实际数据集包含12年(列)的可用数据。

感谢您的任何建议。

my.data <- read.table(text = '
    y1980  y1985  y1990
     0.10   0.20   0.40
     1.00   2.00   4.00
    10.00  20.00  40.00
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)

desired.result <- read.table(text = '
    y1978 y1979 y1980 y1981 y1982 y1983 y1984 y1985 y1986 y1987 y1988 y1989 y1990 y1991 y1992
     0.06  0.08  0.10  0.12  0.14  0.16  0.18  0.20  0.24  0.28  0.32  0.36  0.40  0.44  0.48
     0.60  0.80  1.0   1.2   1.4   1.6   1.8   2.0   2.4   2.8   3.2   3.6   4.0   4.4   4.8
     6     8    10    12    14    16    18    20    24    28    32    36    40    44    48
', header = TRUE, na.string='NA', stringsAsFactors=FALSE)
desired.result

# reshape data to form two columns
new.data  <- reshape(my.data, direction="long", 
                     varying = list(seq(1,(ncol(my.data)-1),1), seq(2,(ncol(my.data)-0),1)), 
                     v.names=c("v1", "v2"))

# interpolate every row
interpol  <- t(apply(new.data[,2:3], 1, function(x) approx(x, n = 6)$y))
new.data2 <- data.frame(time = new.data$time, interpol, id = new.data$id)

# reform row:column structure 
my.data2  <- reshape(new.data2, idvar="id", timevar = "time", direction = "wide")

# middle columns are repeated and must be removed
my.data3  <- my.data2[, !names(my.data2) %in% c("X1.2")]
my.data3

    id X1.1  X2.1  X3.1  X4.1  X5.1 X6.1  X2.2  X3.2  X4.2  X5.2 X6.2
1.1  1  0.1  0.12  0.14  0.16  0.18  0.2  0.24  0.28  0.32  0.36  0.4
2.1  2  1.0  1.20  1.40  1.60  1.80  2.0  2.40  2.80  3.20  3.60  4.0
3.1  3 10.0 12.00 14.00 16.00 18.00 20.0 24.00 28.00 32.00 36.00 40.0

插值的可能替代方法不起作用:

sapply( seq(1, (ncol(my.data)-1), 1), function(i) {approx(c(my.data[,i], my.data[,i+1]), n = 6)$y } )

2 个答案:

答案 0 :(得分:1)

这是一种替代配方。

首先是一个有用的功能:

tvseq <- function(...)t(Vectorize(seq.default)(...))

现在进行插值:

years <- as.numeric(gsub("y","",names(my.data)))

d <- diff(years)

L <- lapply(seq(d), function(i) tvseq(from=my.data[,i], to=my.data[,i+1], length.out=d[i]+1)[,-1])

result <- cbind(my.data[,1], do.call(cbind, L))
colnames(result) <- paste0("y",min(years):max(years))

结果:

> result
     y1980 y1981 y1982 y1983 y1984 y1985 y1986 y1987 y1988 y1989 y1990
[1,]   0.1  0.12  0.14  0.16  0.18   0.2  0.24  0.28  0.32  0.36   0.4
[2,]   1.0  1.20  1.40  1.60  1.80   2.0  2.40  2.80  3.20  3.60   4.0
[3,]  10.0 12.00 14.00 16.00 18.00  20.0 24.00 28.00 32.00 36.00  40.0

要添加外推,请使用:

ylow <- 1978:(min(years)-1)
low <- tvseq(to=result[,1], by=result[,2]-result[,1], length.out=length(ylow)+1)[,1:length(ylow)]
colnames(low) <- paste0("y",ylow)

yhigh <- (max(years)+1):1992
high <- tvseq(from=result[,ncol(result)], by=result[,ncol(result)]-result[,ncol(result)-1], length.out=length(yhigh)+1)[,-1]
colnames(high) <- paste0("y",yhigh)

cbind(low, result, high)

结果:

     y1978 y1979 y1980 y1981 y1982 y1983 y1984 y1985 y1986 y1987 y1988 y1989 y1990 y1991 y1992
[1,]  0.06  0.08   0.1  0.12  0.14  0.16  0.18   0.2  0.24  0.28  0.32  0.36   0.4  0.44  0.48
[2,]  0.60  0.80   1.0  1.20  1.40  1.60  1.80   2.0  2.40  2.80  3.20  3.60   4.0  4.40  4.80
[3,]  6.00  8.00  10.0 12.00 14.00 16.00 18.00  20.0 24.00 28.00 32.00 36.00  40.0 44.00 48.00

答案 1 :(得分:1)

内部和外推的替代方案:

library(zoo)
df <- data.frame(t(my.data))
df$yr <- as.numeric(substring(rownames(df), first = 2))
z1 <- zoo(df, order.by = df$yr, frequency = 1)
t1 <- as.ts(x = z1)
t2 <- na.approx(t1)
future <- apply(t2, 2, function(x) tail(x, 1) + diff(tail(x, 2)) * 1:2)
past <- apply(t2, 2, function(x) head(x, 1) - diff(head(x, 2)) * 1:2)
t3 <- rbind(past, t2, future)
t3 <- t3[order(t3[ , "yr"]), ]
t4 <- t(t3)[1:3, ]
colnames(t4) <- paste0("y", t3[ , "yr"])
t4