Question

我是R的新手，我遇到了一个我自己无法解决的问题。

一位朋友建议我使用其中一个应用函数，我只是不知道如何在这种情况下使用它。无论如何，关于问题！ =）

在内部while循环中，我有一个ifelse。这是瓶颈。每次迭代运行平均需要1秒。慢速部分在代码中标有#slow part start / end。

鉴于此，我们将运行2000 * 100 = 200000次，每次运行此代码需要大约55.5小时才能完成。更大的问题是这将被重复使用。所以x * 55.5小时是不可行的。

以下是与问题相关的代码的一小部分

    #dt is data.table with close to 1.5million observations of 11 variables
    #rand.mat is a 110*100 integer matrix

    j <- 1
    while(j <= 2000)
    {  
            #other code is executed here, not relevant to the question

            i <- 1
            while(i <= 100)
            {
                    #slow part start
                    t$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
                    #slow part end

                    i <- i + 1
            }

            #other code is executed here, not relevant to the question

            j <- j + 1
    }

请，任何建议将不胜感激。

编辑 - 运行以下代码以重现问题

library(data.table)

dt = data.table(datecolumn=c("20121101", "20121101", "20121104", "20121104", "20121130", 
                             "20121130", "20121101", "20121101", "20121104", "20121104", "20121130", "20121130"), column2=c("5", 
                                                                                                "3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column3=c("5", 
                                                                                                                                                                  "3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column4=c
                ("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2"))


unq_date <- c(20121101L, 
20121102L, 20121103L, 20121104L, 20121105L, 20121106L, 20121107L, 
20121108L, 20121109L, 20121110L, 20121111L, 20121112L, 20121113L, 
20121114L, 20121115L, 20121116L, 20121117L, 20121118L, 20121119L, 
20121120L, 20121121L, 20121122L, 20121123L, 20121124L, 20121125L, 
20121126L, 20121127L, 20121128L, 20121129L, 20121130L
)

index <- as.numeric(dt$column4)
numberOfRepititions <- 2
set.seed(131107)

rand.mat <- replicate(numberOfRepititions, sample(unq_date, numberOfRepititions))
i <- 1
while(i <= numberOfRepititions)
{       
    dt$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)      
    i <- i + 1
}

请注意，除非dt在行中增长，否则我们现在无法运行循环超过2次，因此我们有最初的100种类型的column4（这只是一个整数值1-100）

Answer 1

以下是一个基于您的小示例数据集的提案。我试图对操作进行矢量化。与您的示例中一样，numberOfRepititions表示循环次数。

首先，为所有必要的评估创建矩阵。 dt$datecolum与rand.mat的所有列进行比较：

rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)

此处，dt$column4与向量index的所有值进行比较：

imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)

两个矩阵都与逻辑和结合使用。然后，我们计算是否至少有一个TRUE：

replace_idx <- rowSums(rmat & imat) != 0

使用创建的索引将相应的值替换为NA：

is.na(dt$column2) <- replace_idx

完成。

一个块中的代码：

rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
replace_idx <- rowSums(rmat & imat) != 0
is.na(dt$column2) <- replace_idx

Answer 2

我认为你可以像这样在一行中完成：

dt[which(apply(dt, 1, function(x) x[1] %in% rand.mat[,as.numeric(x[4])])),]$column3<-NA

基本上apply函数的作用如下：

1）使用“dt”

中的数据

2）“1”表示按行申请

3）该函数将行传递为“x”，如果符合条件，则返回TRUE

R - 极慢的代码

2 个答案: