Question

我有一个不完整的（时间）序列，我想使用其他序列中各个类别（国家/地区）的可用最新值和增长率来填充缺失的值。类别，缺失值不等长。这需要按顺序对一个变量应用一个函数：首先，我需要获取最后一个可用的数据点（可以在任何地方）并将其除以1+增长率，然后移至下一个数据点并执行相同的操作。

示例数据集和所需结果：

require(data.table)
DT_desired<-data.table(category=c(rep("A",4),rep("B",4)),
           year=2010:2013,
           grwth=c(NA,.05,0.1,0,NA,0.1,0.15,0.2))
DT_desired[,values:=c(cumprod(c(1,DT_desired[category=="A"&!is.na(grwth),grwth]+1)),cumprod(c(1,DT_desired[category=="B"&!is.na(grwth),grwth]+1)))]

DT_example <- copy(DT_desired)[c(1,2,3,5),values:=NA]

我尝试过的方法：您可以通过for循环来执行此操作，但这在R中效率低下，不鼓励使用。我开始喜欢data.table的效率，因此我最好采用这种方式。我尝试了数据表的移位功能，该功能仅填充一个丢失的值（这是合乎逻辑的，因为它试图在我想同时执行，而其余的丢失了前一个值）。

DT_example[,values:=ifelse(is.na(values),shift(values,type = "lead")/(1+shift(grwth,type = "lead")),values),by=category]

我从其他帖子中收集到，您可能可以使用zoo程序包的rollapply函数来完成此操作，但是我只是觉得我应该能够在数据表中执行此操作，而无需再使用其他程序包，并且该解决方案相对简单而优雅，只是我没有足够的经验来找到它。

如果我没有注意到适当的帖子，这很可能是重复的，很抱歉，但是我发现的任何内容都不符合我想要的。

Answer 1

不确定在SO之外是否已解决此问题，但是前几天引起了我的注意。我已经很久没有写Rcpp了，所以我认为这将是一种很好的做法。我知道您正在寻找本机的data.table解决方案，因此可以随时接受或保留它：

foo.cpp文件的内容：

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector fillValues(NumericVector vals, NumericVector gRates){

  int n = vals.size();
  NumericVector out(n);

  double currentValue   = vals[n - 1];
  double currentGrowth  = gRates[n - 1];

  // initial assignment
  out[n - 1] = currentValue;

  for(int i = n - 2; i >= 0; i--){

    if(NumericVector::is_na(vals[i])){
      // If val[i] is na, we need prior values to populate it
      if(!((currentValue || currentValue == 0) && (currentGrowth || currentGrowth == 0))){
        // We need a currentValue and currentGrowth to base growth rate on, throw error
        Rcpp::stop("NaN Values for rates or value when needed actual value");
      } else {
        // Update value
        out[i] = currentValue / (1 + currentGrowth);
      }
    } else {
      out[i] = vals[i];
    }

    // update
    currentValue = out[i];
    if(!NumericVector::is_na(gRates[i])){
      currentGrowth = gRates[i];
    }
  }

  return out;
}

/*** R
require(data.table)
DT_desired<-data.table(category=c(rep("A",4),rep("B",4)),
                       year=2010:2013,
                       grwth=c(NA,.05,0.1,0,NA,0.1,0.15,0.2))

DT_desired[,values:=c(cumprod(c(1,DT_desired[category=="A"&!is.na(grwth),grwth]+1)),cumprod(c(1,DT_desired[category=="B"&!is.na(grwth),grwth]+1)))]

DT_example <- copy(DT_desired)[c(1,2,3,5),values:=NA]

DT_desired[]
DT_example[]

DT_example[, values:= fillValues(values, grwth)][]
*/

然后运行它：

> Rcpp::sourceCpp('foo.cpp')

# Removed output that created example data

> DT_desired[]
   category year grwth values
1:        A 2010    NA  1.000
2:        A 2011  0.05  1.050
3:        A 2012  0.10  1.155
4:        A 2013  0.00  1.155
5:        B 2010    NA  1.000
6:        B 2011  0.10  1.100
7:        B 2012  0.15  1.265
8:        B 2013  0.20  1.518

> DT_example[]
   category year grwth values
1:        A 2010    NA     NA
2:        A 2011  0.05     NA
3:        A 2012  0.10     NA
4:        A 2013  0.00  1.155
5:        B 2010    NA     NA
6:        B 2011  0.10  1.100
7:        B 2012  0.15  1.265
8:        B 2013  0.20  1.518

> DT_example[, values:= fillValues(values, grwth)][]
   category year grwth values
1:        A 2010    NA  1.000
2:        A 2011  0.05  1.050
3:        A 2012  0.10  1.155
4:        A 2013  0.00  1.155
5:        B 2010    NA  1.000
6:        B 2011  0.10  1.100
7:        B 2012  0.15  1.265
8:        B 2013  0.20  1.518

请注意，这是从前开始的，因此假定您要从最近的录制开始，然后再从更远的位置开始录制。它还假定您的数据集已排序。

使用类别增长率填写数据表中的缺失值

1 个答案: