Question

我有一个包含ID，年份和收入列表的数据集。我试图将年度值插值为季度值。

DATA

例如，我希望获得2000Q1,2000Q2,2000Q3,2000Q4,2001Q1，......，2001Q4的年度（插值）收入值。所以数据框将是id，年度季度，收入。收入将基于内插收入。

我意识到线性插值时，趋势必须仅基于相应的ID。关于如何在R中进行插值的任何建议？

Answer 1

以下是使用dplyr的示例：

library(dplyr)

annual_data <- data.frame(
    person=c(1, 1, 1, 2, 2),
    year=c(2010, 2011, 2012, 2010, 2012),
    y=c(1, 2, 3, 1, 3)
    )

expand_data <- function(x) {
    years <- min(x$year):max(x$year)
    quarters <- 1:4
    grid <- expand.grid(quarter=quarters, year=years)
    x$quarter <- 1
    merged <- grid %>% left_join(x, by=c('year', 'quarter'))
    merged$person <- x$person[1]
    return(merged)
}

interpolate_data <- function(data) {
    xout <- 1:nrow(data)
    y <- data$y
    interpolation <- approx(x=xout[!is.na(y)], y=y[!is.na(y)], xout=xout)
    data$yhat <- interpolation$y
    return(data)
}

expand_and_interpolate <- function(x) interpolate_data(expand_data(x))

quarterly_data <- annual_data %>% group_by(person) %>% do(expand_and_interpolate(.))

print(as.data.frame(quarterly_data))

这种方法的输出是：

   quarter year person  y yhat
1        1 2010      1  1 1.00
2        2 2010      1 NA 1.25
3        3 2010      1 NA 1.50
4        4 2010      1 NA 1.75
5        1 2011      1  2 2.00
6        2 2011      1 NA 2.25
7        3 2011      1 NA 2.50
8        4 2011      1 NA 2.75
9        1 2012      1  3 3.00
10       2 2012      1 NA   NA
11       3 2012      1 NA   NA
12       4 2012      1 NA   NA
13       1 2010      2  1 1.00
14       2 2010      2 NA 1.25
15       3 2010      2 NA 1.50
16       4 2010      2 NA 1.75
17       1 2011      2 NA 2.00
18       2 2011      2 NA 2.25
19       3 2011      2 NA 2.50
20       4 2011      2 NA 2.75
21       1 2012      2  3 3.00
22       2 2012      2 NA   NA
23       3 2012      2 NA   NA
24       4 2012      2 NA   NA

可能有很多方法可以清理它。正在使用的关键功能是expand.grid，approx和dplyr::group_by。 approx函数有点棘手。查看zoo::na.approx.default的实施情况对于弄清楚如何使用approx非常有帮助。

Answer 2

我喜欢使用此约定将数据帧拆分为子集（在您的情况下为＆＃39; id＆＃39;的唯一值），将函数应用于每个子集，然后将数据帧重新组合在一起。

df2 <- do.call("rbind", lapply(split(df, df$id), function(df_subset) {

  # the operations inside these brackets will be appied to a subset dataframe
  #   that is equivalent to doing 'subset(df, id == x)' where x is each unique value of id

  return(df_subset) # this just returns df_subset unchanged, but you alter it in any way you need

}))

有几种方法可以进行线性插值，但我个人默认使用来自动物园的na.approx（）＆＃39;包。您需要在数据框中添加代表每个季度的行，并为其income值添加NA。然后，na.approx将使用内插值填充它们，如df_subset$income_interpolated <- na.approx(df_subset$income)

使用季度值在R年度时间序列数据中插值

2 个答案: