Question

我正在收集关于我的猫粪便多少的数据：

m <- cbind(fluffy=c(1.1,1.2,1.3,1.4),misterCuddles=c(0.9,NA,1.1,1.0))
row.names(m) <- c("2013-01-01", "2013-01-02", "2013-01-03","2013-01-04")

这给了我这个：

           fluffy misterCuddles
2013-01-01    1.1           0.9
2013-01-02    1.2            NA
2013-01-03    1.3           1.1
2013-01-04    1.4           1.0

在每个日期，我都想知道每只猫连续多少天已经走了2号。所以得到的矩阵应该是这样的：

           fluffy misterCuddles
2013-01-01      1             1
2013-01-02      2             0
2013-01-03      3             1
2013-01-04      4             2

有没有办法有效地做到这一点？ cumsum函数执行类似的操作，但这是一个原始函数，因此我无法修改它以满足我肮脏，肮脏的需求。

我可以运行for循环并存储一个类似的计数：

m.output <- matrix(nrow=nrow(m),ncol=ncol(m))
for (column in 1:ncol(m)) {
  sum <- 0
  for (row in 1:nrow(m)) {
    if (is.na(m[row,column])) sum <- 0
    else sum <- sum + 1

    m.output[row,column] <- sum
  }
}

这是最有效的方法吗？我有很多猫，我记录了多年的便便数据。我可以通过某种方式将其平行化吗？

Answer 1

这里的所有答案实际上都太复杂了（包括我自己，以前复制过，下面复制过）。 Reduce系列答案只是在单个函数调用中屏蔽for循环。我喜欢罗兰和阿南达，但我认为这两者都有点太多了。

因此，这是一个简单的矢量化解决方案：

reset <- function(x) {
    s <- seq_along(x)
    s[!is.na(x)] <- 0
    seq_along(x) - cummax(s)
}

> apply(m, 2, reset)
     fluffy misterCuddles
[1,]      1             1
[2,]      2             0
[3,]      3             1
[4,]      4             2

它也适用于罗兰的例子：

m2 <- cbind(fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
           misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA))

> apply(m2, 2, reset)
     fluffy misterCuddles
[1,]      0             0
[2,]      1             1
[3,]      2             2
[4,]      3             0
[5,]      4             0
[6,]      5             1
[7,]      6             0

从早些时候开始：这不是矢量化的，但也有效：

pooprun <- function(x){
    z <- numeric(length=length(x))
    count <- 0
    for(i in 1:length(x)){
        if(is.na(x[i]))
            count <- 0
        else
            count <- + count + 1
        z[i] <- count
    }
    return(z)
}
apply(m, 2, pooprun)

> apply(m, 2, pooprun)
     fluffy misterCuddles
[1,]      1             1
[2,]      2             0
[3,]      3             1
[4,]      4             2

基准

这里我只是简单地将每个人的答案包装在一个函数调用中（基于他们的名字）。

> library(microbenchmark)
> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
         expr     min       lq   median       uq       max neval
     alexis()   1.540   4.6200   5.3890   6.1590   372.185  1000
     hadley()  87.755   92.758   94.298  96.6075  1767.012  1000
     thomas()  92.373  99.6860 102.7655 106.6140   315.223  1000
    matthew() 128.168 136.2505 139.7150 145.4880  5196.344  1000
 thomasloop() 133.556 141.6390 145.1030 150.4920 84131.427  1000
      usobi() 148.182 159.9210 164.7320 174.1620  5010.445  1000
     ananda() 720.507 742.4460 763.6140 801.3335  5858.733  1000

以下是Roland示例数据的结果：

> microbenchmark(alexis(), hadley(), thomas(), matthew(), thomasloop(), usobi(), ananda(), times=1000)
Unit: microseconds
         expr     min       lq   median       uq      max neval
     alexis()   2.310   5.3890   6.1590   6.9290   75.438  1000
     hadley()  75.053   78.902   80.058   83.136 1747.767  1000
     thomas()  90.834  97.3770 100.2640 104.3050  358.329  1000
    matthew() 139.715 149.7210 154.3405 161.2680 5084.728  1000
 thomasloop() 144.718 155.4950 159.7280 167.4260 5182.103  1000
      usobi() 177.048 188.5945 194.3680 210.9180 5360.306  1000
     ananda() 705.881 729.9370 753.4150 778.8175 8226.936  1000

注意：Alexis和Hadley的解决方案需要花费很长时间才能在我的机器上实际定义为功能，而其他解决方案是开箱即用的，但Alexis是明显的赢家。

Answer 2

这应该有效。请注意，您的每只猫都是独立的个体，因此您可以将数据框转换为列表，并使用mclapply使用并行方法。

count <- function(y,x){
  if(is.na(x)) return(0)
  return (y + 1)
}

oneCat = m[,1]

Reduce(count,oneCat,init=0,accumulate=TRUE)[-1]

编辑：这是完整的答案

count <- function(x,y){
 if(is.na(y)) return(0)
 return (x + 1)
}

mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)

EDIT2：主要的不好的问题是我在开始时得到额外的0，所以......

result = mclapply(as.data.frame(m),Reduce,f=count,init=0,accumulate=TRUE)
finalResult = do.call('cbind',result)[-1,]
rownames(finalResult) = rownames(m)

完成这项工作。

Answer 3

另一种选择，类似于@ Usobi，它使用Reduce，但方法略有不同：

apply(!is.na(m), 2, Reduce, f=function(x,y) if (y) x + y else y, accumulate=TRUE)
#      fluffy misterCuddles
# [1,]      1             1
# [2,]      2             0
# [3,]      3             1
# [4,]      4             2

Answer 4

我保存了here的一个片段，几乎完全可以解决这样的问题：

countReset <- function(x) {
  x[!is.na(x)] <- 1
  y <- ave(x, rev(cumsum(rev(is.na(x)))), FUN=cumsum)
  y[is.na(y)] <- 0
  y
}
apply(m, 2, countReset)
#            fluffy misterCuddles
# 2013-01-01      1             1
# 2013-01-02      2             0
# 2013-01-03      3             1
# 2013-01-04      4             2

Answer 5

由于我正处于试图习惯.Call的时期，这里的另一个想法似乎有效 - 可能 - 很快。（不要相信我的话，但我的技能不值得信赖！）：

library(inline)  #use "inline" package for convenience

f <- cfunction(sig = c(R_mat = "numeric", R_dims = "integer"), body = '
 R_len_t *dims = INTEGER(R_dims);
 R_len_t rows = dims[0], cols = dims[1];
 double *mat = REAL(R_mat);

 SEXP ans;
 PROTECT(ans = allocMatrix(INTSXP, rows, cols));
 R_len_t *pans = INTEGER(ans);

 for(int ic = 0; ic < cols; ic++)
  {
   pans[0 + ic*rows] = ISNA(mat[0 + ic*rows]) ? 0 : 1;

   for(int ir = 1; ir < rows; ir++)
    {
     if(ISNA(mat[ir + ic*rows]))
      {
       pans[ir + ic*rows] = 0;
      }else
      {
       if(!ISNA(mat[(ir - 1) + ic*rows]))
        {
         pans[ir + ic*rows] = pans[(ir - 1) + ic*rows] + 1;
        }else
        {
         pans[ir + ic*rows] = 1;
        }
      }
    }
  }

 UNPROTECT(1);

 return(ans);
')

f(m, dim(m))
#     [,1] [,2]
#[1,]    1    1
#[2,]    2    0
#[3,]    3    1
#[4,]    4    2
f(mm, dim(mm))   #I named Roland's matrix, mm ; I felt that I had to pass this test!
#     [,1] [,2]
#[1,]    0    0
#[2,]    1    1
#[3,]    2    2
#[4,]    3    0
#[5,]    4    0
#[6,]    5    1
#[7,]    6    0

Answer 6

所以这个问题的解决方案有两个部分：

每只猫接受一个向量并返回一个向量的函数，告诉我每个日期，自上次NA以来的天数
接受NxM矩阵并返回NxM矩阵的函数，将函数（1）应用于每列

对于（2），我从@ Usobi的回答改编了这个：

daysSinceLastNA <- function(matrix, vectorFunction, cores=1) {
  listResult <- mclapply(as.data.frame(matrix), vectorFunction, mc.cores=cores)
  result <- do.call('cbind', listResult)
  rownames(result) <- rownames(matrix)
  result
}

对于（1），我有两个解决方案：

@ ananda-mahto的解决方案：

daysSinceLastNA_1 <- function(vector) {
  vector[!is.na(vector)] <- 1
  result <- ave(vector, rev(cumsum(rev(is.na(vector)))), FUN=cumsum)
  result[is.na(result)] <- 0
  result
}

@ Usobi的解决方案：

daysSinceLastNA_2 <- function(vector) {
  reduction <- function(total, additional) ifelse(is.na(additional), 0, total + 1)
  Reduce(reduction, vector, init=0, accumulate=TRUE)[-1]
}

然后我这样打电话给他们：

> system.time(result1 <- daysSinceLastNA (test, daysSinceLastNA_1 ))
   user  system elapsed 
   5.40    0.01    5.42 
> system.time(result2 <- daysSinceLastNA (test, daysSinceLastNA_2 ))
   user  system elapsed 
  58.02    0.00   58.03

在我的测试数据集（大约是2500x2500矩阵）上，第一种方法的速度提高了一个数量级。

如果我使用64核运行Linux，则解决方案（1）在2秒内运行，解决方案（2）在6秒内运行。

Answer 7

对于这种可以通过for循环轻松解决的问题，我发现Rcpp是一个非常自然的答案。

library(Rcpp)

cppFunction("NumericVector cumsum2(NumericVector x) {
  int n = x.length();
  NumericVector out(x);

  for(int i = 0; i < n; ++i) {
    if (NumericVector::is_na(x[i]) || i == 0) {
      x[i] = 0;
    } else {
      x[i] = x[i - 1] + 1;
    }
  }

  return out;
}")

代码需要比等效的R代码更多的簿记，但是大部分功能都是一个非常简单的for循环。

然后您可以像任何其他矢量化函数一样在R中应用：

m2 <- cbind(
  fluffy=c(NA,1.1,1.2,1.3,1.4,1.0,2),
  misterCuddles=c(NA,1.3,2,NA,NA,1.1,NA)
)

apply(m2, 2, cumsum2)

你当然可以让C ++代码遍历矩阵的列，但我认为由于这已经在R中很容易表达，你可以使用内置工具。

使用重置计算矩阵中的列数据

7 个答案: