Question

我正在尝试使用来自迭代器包https://cran.r-project.org/web/packages/iterators/iterators.pdf的迭代器在R中创建一个函数，以帮助迭代器遍历数据帧的每一行。
给定这样的表格：

        data<-data.frame(c(1,0,0,NA,NA,1,1,NA,0), ncol=3, byrow=TRUE)
        >data
                [,1] [,2] [,3] 
        [1,]     1    0    0   
        [2,]     NA   NA   1   
        [3,]     1    NA   0

我希望它遍历每一行并从左到右返回第一个非NA值，如果所有值均为NA，则返回NA。因此，使用上述数据框，它应该返回1、1、1。
我现在大致的想法是像这样使用包中的iter（）函数：

vec<-vector()
iterRow<-iter(data[x,]) #Creates iterator object for row x of data
i<-1
while(i<iterRow$length){ #iterRow$length gives # of columns essentially
     temp<-nextElem(iterRow) #Set temp to the next element of the iterator
     if(!is.na(temp)){ #If the value is not NA, set value in vec to the value
         vec<-c(vec, temp)
     }
     i<-i+1
}
vec<-c(vec, NA) #Otherwise set it to NA
 return(vec)

我正在使用的数据将长达几百万行，因此理想情况下，我想对函数进行矢量化处理。我一直坚持如何在整个数据框架中应用该想法。
制作这样的函数是否可行：

iterateRows<- function(dataFrame){
...
}

以我正在使用的数据框作为参数。

我也知道c ++，所以如果使用c ++编写类似的函数会更容易，我也可以这样做。任何帮助将不胜感激！

Answer 1

从一个简单的方法开始。这是一个对每一行进行所需操作的函数：

first_not_na = function(x) if(all(is.na(x))) NA else x[!is.na(x)][1]

这里有几种简单的方法可以将其应用于每一行数据。

# apply
results = apply(data, 1, foo)

# for loop
results = numeric(nrow(data))
for (i in 1:row) results[i] = foo(data[i, ])

这是比较大型数据时序的基准：

row = 1e6
col = 5

data = matrix(sample(c(1, 0, NA), size = row * col, replace = T), nrow = row)

microbenchmark::microbenchmark(
  apply = {results = apply(data, 1, foo)},
  loop = {
    results = numeric(row)
    for (i in 1:row) results[i] = foo(data[i, ])
  },
  times = 5
)
# Unit: seconds
#   expr      min       lq     mean   median       uq      max neval cld
#  apply 2.140379 2.249405 2.399239 2.480180 2.524667 2.601563     5   a
#   loop 1.970481 1.982853 2.160342 2.090484 2.264797 2.493095     5   a

一个简单的for循环大约需要2秒钟来处理1M行和5列。如果要提高速度，可以使用foreach进行并行化。仅在静止不够快的情况下，您才需要寻找更复杂的解决方案，例如iterators或C ++实现。

在R

1 个答案: