Question

我正在尝试清理大型数据集。我有一个按日期顺序排列的价格矩阵，其中第一行中的最新日期和列中的不同股票。如果某一特定股票的某一天价格缺失或NA，我会使用前一天的价格。如果最后一天的价格是NA，我就离开了。

我首先循环整个矩阵并为每个（i，j）对使用IF语句。这非常慢。对于价格矩阵b，下一个方法如下：我使用索引来查找NA的索引，并且只处理这些索引。

for(j in 1:ncol(b)) 
{ 

Index<-which( is.na(b[,j]) | b[,j]==0) 
if(length(Index)==0)
{print("0 Missings")
Index<-c(1)#to ensure its not empty}
for(k in length(Index):1 )#backwards to fill from the bottom
{
i=Index[k]
#If the oldest date is missing, then set it to N/A so that N/A is passed forward as opposed to 0.

if( i==nrow(b) & ( b[[i,j]]==0 | is.na(b[[i,j]]) ) ) 
{ 
  b[[i,j]]<-'#N/A' 
} 
else( b[[i,j]]==0 | is.na(b[[i,j]]) ) 
{
  b[[i,j]] <- b[[i+1,j]]#Take the price from the date before      
}

}

}

这有点快，但不多。对于400x6000矩阵，它仍然需要一个多小时。我希望有一个完全矢量化的方法，我做了类似的事情：

b[[Index,j]]<-b[[Index+1,j]]

但是，我认为R不会使用顺序更新的值。通过这个，我的意思是它不会从底部逐步更新，以便使用新值。当我连续有2个NA条目时，这很重要，因为上面的矢量化方法只会填充一个。但某种有效的顺序矢量化代码将更新第一个，并使用它来更新第二个。有什么想法吗？

非常感谢您的努力

Answer 1

这是使用MESS包的可能性，与上面的@Roland评论基本上没有什么不同，所以我只在此处包含它，以便您可以看到格式。 filldown函数是用C ++编写的，所以速度相当快：

x <- matrix(c(1, 2, 3, 4, NA, 6, NA, NA, NA, NA, 11, 12, 13, 14, 15, NA, 17, 18, NA, 20), nrow=5)
 x
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   NA
[2,]    2   NA   12   17
[3,]    3   NA   13   18
[4,]    4   NA   14   NA
[5,]   NA   NA   15   20

然后使用

library(MESS)
apply(x, 2, filldown)

产生

    [,1] [,2] [,3] [,4]
[1,]    1    6   11   NA
[2,]    2    6   12   17
[3,]    3    6   13   18
[4,]    4    6   14   18
[5,]    4    6   15   20

Answer 2

如果你想要一个完整的R版本，那么我会考虑你：

首先定义一个合适的大型测试集：

set.seed(42)
nRow <- 1000
nCol <- 500
test <- matrix(rnorm(nRow * nCol), 
               nrow = nRow, 
               ncol = nCol)
test[sample(nRow * nCol, nRow)] <- NA

然后编写以你想要的方式向下传递的代码（适用于每一列）。请注意，可怕的擅长＆＃39;＃N / A＆＃39;已被转换为NaN，维持存储模式（即numeric）。

innerF <- function(x){

  # Nothing to change
  if(!any(idx <- is.na(x) | x == 0))
    return(x)

  # Alter first value
  if(is.na(x[1]) | x[1] == 0)
    x[1] <- NaN

  # First value altered
  idx[1] <- FALSE

  # Pass down
  x[idx] <- x[which(idx) - 1]

  # Return
  x
}

然后定义一个调用策略：

outerF <- function(x){

  # Run once
  y <- innerF(x)

  # Run till done
  while(any((is.na(y) & !is.nan(y)) | 
            (!is.na(y) & y == 0L))){
    y <- innerF(y)
  }

  # Return
  y
}

测试它与替代品，并哭泣....（提示：使用MESS及其C ++）：

library(microbenchmark)
library(MESS)

microbenchmark(apply(test, 2, outerF), times = 100)
#Unit: microseconds
#                   expr    min       lq     mean   median      uq     max neval
# apply(test, 2, outerF) 630.07 652.4505 806.4808 670.6965 686.234 3253.27   100

microbenchmark(apply(test, 2, filldown), times = 100)
#Unit: microseconds
#                 expr     min      lq     mean  median      uq      max neval
# apply(test, 2, filldown) 107.482 110.048 134.9092 112.329 114.895 1980.016   100

Answer 3

My R Studio不允许我安装MOSS和ZOO软件包，所以我必须找到类似于na.locf的解决方案。这里的代码是为了防止任何人想要使用这种方法：

start.time<-Sys.time()
nrow<-nrow(b)
for(j in 2:ncol(b)) 
{

  ColumnReversed<-rev(b[,j]) #So we fill from the bottom - Oldest date first
  Index<-!is.na(as.numeric(matrix(ColumnReversed,ncol=1))) #1 for valid, 0 for Missing

  ValidVals <- c("NA",ColumnReversed[Index]) #[NA,final known, second final known,...,first known]
  FilledIndex <- cumsum(Index) + 1  #    [0,0,0,0,0...,1,1,1,1,...,2,2,2,2,2,...3,3,3,3,3...,k,k] + 1
  #This line stores the index of ValidValues which contains the prices (and values to be filled)
  b[,j]<-rev( matrix(ValidVals[FilledIndex],ncol=1) )#Store in reversed order 

}

时间从90分钟提高到65秒。惊人！

在R中使这个简单的循环更有效？

3 个答案: