R编程:如何加速需要2小时的循环以及需要花费很多时间的原因

时间:2016-03-07 22:18:29

标签: r performance for-loop data.table

我正在做一个for循环来填充一个向量,但这个循环需要2个小时。我不知道是不是因为我可能会做一些无效的事情,或者只是因为R的循环很慢。我必须为这部分使用循环,因为我需要前一个值,所以我无法对操作进行矢量化。

我正在使用数据包data.table 我的笔记本电脑有8GB内存,英特尔酷睿i5 pro 2.3GHz R版本64位3.2.3

该表具有以下结构(按NUMDCRED和FDES升序排序):

NUMDCRED         FDES       Flag_Entrada_Mora  Flag_Salida_Mora   
 0001        "2012-01-01"         0                   0
 0001        "2012-03-01"         1                   0
 0001        "2012-04-01"         0                   0
 0002        "2011-01-01"         0                   0
 0002        "2011-02-01"         0                   0
 0002        "2011-03-01"         0                   0
 0003        "2012-05-01"         0                   0
 0003        "2012-06-01"         0                   1
 0003        "2012-07-01"         0                   0

代码使用Variable FDES,FLAG_Entrada_Mora和FLAG_Salida_Mora创建两个新变量Ult_Entrada_Mora和Ult_Salida_Mora。 Ult_Entrada_Mora注册NUMDCRED输入mora的最后日期,Ult_Salida_Mora注册NUMDCRED退出mora的最后日期。当每个NUMDCRED是第一个(我的意思是出现NUMDCRED的第一个日期)时,Ult_Entrada_Mora必须是FDES值,并且必须重复该日期,直到每次Flag_Entrada为1时更新日期,每次NUMDCRED为时都会更新Ult_Salida_Mora第一个必须注册一个NA值,直到由Flag_Salida_Mora更新,并且必须重复这个值,直到更新为止等等。

在我的代码中,First_Numdcred_Index为我提供了出现新NUMDCRED的行,如果i值属于其中一个索引,我需要检查%in%。 aux_entrada和aux_salida仅在我之前描述的事件之一发生时才更新。

上面示例的表格输出为

NUMDCRED         FDES       Flag_Entrada_Mora  Flag_Salida_Mora Ult_En_Mo
 0001        "2012-01-01"         0                   0         "2012-01-01"
 0001        "2012-03-01"         1                   0         "2012-03-01"
 0001        "2012-04-01"         0                   0         "2012-03-01"
 0002        "2011-01-01"         0                   0         "2011-01-01"
 0002        "2011-02-01"         0                   0         "2011-01-01"
 0002        "2011-03-01"         0                   0         "2011-01-01"
 0003        "2012-05-01"         0                   0         "2012-05-01"
 0003        "2012-06-01"         0                   1         "2012-05-01"
 0003        "2012-07-01"         0                   0         "2012-05-01"

Ult_Salida_Mora
     NA
     NA
     NA
     NA 
     NA
     NA
     NA
   "2012-06-01"
   "2012-06-01"

以下是我用来运行循环(n2 = 648,385

的代码

First_NumCred_index是一个包含表的一系列行索引的向量。其长度为148,982,等级为numeric FDESIDateFlag_EntradaFlag_Salidanumeric

n2 <- length(Poblacion_Morosa3$NUMDCRED)
Ult_Entrada_Mora <- seq(as.IDate("2020-01-01"),by = "month",length.out = n2)
#vector(mode = "character",length=n2)
Ult_Salida_Mora <- seq(as.IDate("2020-01-01"),by = "month",length.out = n2)

aux_entrada <- as.IDate("2005-01-01")
aux_salida <- as.IDate("2005-01-01")

for(i in 1:n2){ 

 if(i %in% First_NumdCred_index){

    aux_entrada <- Poblacion_Morosa3[i,FDES]
    aux_salida <- NA
   } else if(Poblacion_Morosa3[i,Flag_Entrada_Mora] == 1){

     aux_entrada <- Poblacion_Morosa3[i,FDES]
   } else if(Poblacion_Morosa3[i,Flag_Salida_Mora] == 1){

    aux_salida <- Poblacion_Morosa3[i,FDES]
   }

  Ult_Entrada_Mora[i] <- aux_entrada
  Ult_Salida_Mora[i] <- aux_salida
}

我想知道是否正常运行需要2个小时才能运行,或者我是否做得效率低下。

3 个答案:

答案 0 :(得分:2)

在这里做了我不确定你是否正在尝试做的事情:

library(data.table)
set.seed(123)
ex <- data.table(FDES=sample(seq(as.IDate("2001-01-01"),by="month",length=100),
                             1000,replace=T),
                 flag_entrance=sample(c(0,1),1000,replace=T),
                 flag_exit=sample(c(0,1),1000,replace=T))
First_NumCred_index <- sample(1:nrow(ex),250,replace=F)

> ex
            FDES flag_entrance flag_exit
   1: 2003-05-01             0         0
   2: 2007-07-01             1         0
   3: 2004-05-01             0         0
   4: 2008-05-01             1         1
   5: 2008-11-01             1         0
  ---                                   
 996: 2007-11-01             1         0
 997: 2006-05-01             0         1
 998: 2004-04-01             1         0
 999: 2006-11-01             0         0
1000: 2001-11-01             0         1

现在我们可以在几个过程中处理这个问题。你甚至可以让它快一点,但这似乎足够快......

ex[,`:=`(date.seq.1=as.IDate(NA_integer_,origin="1970-01-01"),
         date.seq.2=as.IDate(NA_integer_,origin="1970-01-01"))]
ex[First_NumCred_index,date.seq.1:=FDES]
ex[flag_entrance==1,date.seq.1:=FDES]
ex[flag_exit==1,date.seq.2:=FDES] 

> ex
            FDES flag_entrance flag_exit date.seq.1 date.seq.2
   1: 2003-05-01             0         0       <NA>       <NA>
   2: 2007-07-01             1         0 2007-07-01       <NA>
   3: 2004-05-01             0         0 2004-05-01       <NA>
   4: 2008-05-01             1         1 2008-05-01 2008-05-01
   5: 2008-11-01             1         0 2008-11-01       <NA>
  ---                                                         
 996: 2007-11-01             1         0 2007-11-01       <NA>
 997: 2006-05-01             0         1       <NA> 2006-05-01
 998: 2004-04-01             1         0 2004-04-01       <NA>
 999: 2006-11-01             0         0       <NA>       <NA>
1000: 2001-11-01             0         1       <NA> 2001-11-01

所以你保留了你的NAs日期序列,你(显然?)想要它们,并可以将它们恢复为ex[,date.seq.1]等的载体。

我猜我没理解你的问题。 特别是,您说您需要有时参考前一行的值。如果是这种情况,您可以将上述建议与对shift的调用结合起来。例如,如果您需要在条件满足时采用前一行的值,否则使用当前行的值,&#34;你可以做点什么

ex[,date.seq.3:=ifelse( condition, shift(FDES), FDES)]

最佳。

修改以展开我的评论。如果你想要的只是&#34;继续重复最后的日期直到你看到1,然后改为后续的日期,&#34;然后你可以尝试这样的事情:

> ex[,.(FDES,flag_entrance,FDES[cumsum(rle(flag_entrance)$values)])]
            FDES flag_entrance         V3
   1: 2003-05-01             0 2003-05-01
   2: 2007-07-01             1 2003-05-01
   3: 2004-05-01             0 2007-07-01
   4: 2008-05-01             1 2007-07-01
   5: 2008-11-01             1 2004-05-01
  ---  

如果您在data.table中复制此向量而不是仅仅抓取向量,请小心回收。

答案 1 :(得分:1)

我怀疑循环中的%in%操作占用了大部分时间。您可以通过以下方式预先计算循环结果来删除它:

isFirstNumdCred <- (1:n2) %in% First_NumdCred_index
for(i in 1:n2){ 
   if(isFirstNumdCred[i]){
      aux_entrada <- Poblacion_Morosa3[i,FDES]
      aux_salida <- NA
   } else if(Poblacion_Morosa3[i,Flag_Entrada_Mora] == 1){
      aux_entrada <- Poblacion_Morosa3[i,FDES]
   } else if(Poblacion_Morosa3[i,Flag_Salida_Mora] == 1){
      aux_salida <- Poblacion_Morosa3[i,FDES]
   }

   Ult_Entrada_Mora[i] <- aux_entrada
   Ult_Salida_Mora[i] <- aux_salida
}

答案 2 :(得分:1)

在我看来,findInterval()是解决这个问题最合适的功能。您的中间变量基本上保留其先前的值,除了行序列中的已知标记,它们更改为已知值,固定(NA)或在输入框架(FDES列)中查找。我们可以使用findInterval()根据所需的逻辑找到最接近的先前标记,并使用获胜标记索引索引目标值的向量。

## libs
library(data.table);

## generate test data
set.seed(4L);
n2 <- 648385L;
Poblacion_Morosa3 <- data.table(
    NUMDCRED=sprintf('%04d',cumsum(c(T,sample(c(rep(F,3L),T),n2-1L,replace=T)))), ## avg 4 rows per num
    FDES=seq(as.IDate('2011-01-01'),by=1,len=n2),
    Flag_Entrada_Mora=sample(c(rep(0L,5L),1L),n2,replace=T), ## avg 6 rows per flag
    Flag_Salida_Mora=sample(c(rep(0L,5L),1L),n2,replace=T) ## ditto
);

## solution
system.time({
    findLastIndex <- function(iall,imark) c(0L,imark)[findInterval(iall,imark)+1L];
    n2 <- nrow(Poblacion_Morosa3);
    row.seq <- seq_len(n2);
    num.start <- c(T,Poblacion_Morosa3[,NUMDCRED[-.N]!=NUMDCRED[-1L]]);
    entrada.fdes <- findLastIndex(row.seq,which(num.start | Poblacion_Morosa3[,Flag_Entrada_Mora==1]));
    Ult_Entrada_Mora <- Poblacion_Morosa3[entrada.fdes,FDES];
    salida.na <- findLastIndex(row.seq,which(num.start));
    salida.fdes <- findLastIndex(row.seq,which(Poblacion_Morosa3[,Flag_Salida_Mora==1]));
    Ult_Salida_Mora <- c(as.IDate(NA),Poblacion_Morosa3[,FDES])[ifelse(salida.fdes>=salida.na,salida.fdes+1L,1L)];
});
##   user  system elapsed
##  0.328   0.047   0.374
## show result
head(cbind(Poblacion_Morosa3,Ult_Entrada_Mora,Ult_Salida_Mora),50L);
##     NUMDCRED       FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_Entrada_Mora Ult_Salida_Mora
##  1:     0001 2011-01-01                 0                0       2011-01-01            <NA>
##  2:     0001 2011-01-02                 0                0       2011-01-01            <NA>
##  3:     0001 2011-01-03                 1                0       2011-01-03            <NA>
##  4:     0001 2011-01-04                 0                0       2011-01-03            <NA>
##  5:     0001 2011-01-05                 0                0       2011-01-03            <NA>
##  6:     0002 2011-01-06                 0                0       2011-01-06            <NA>
##  7:     0002 2011-01-07                 0                0       2011-01-06            <NA>
##  8:     0002 2011-01-08                 0                0       2011-01-06            <NA>
##  9:     0003 2011-01-09                 1                0       2011-01-09            <NA>
## 10:     0004 2011-01-10                 1                0       2011-01-10            <NA>
## 11:     0004 2011-01-11                 0                0       2011-01-10            <NA>
## 12:     0005 2011-01-12                 0                0       2011-01-12            <NA>
## 13:     0005 2011-01-13                 1                0       2011-01-13            <NA>
## 14:     0005 2011-01-14                 0                0       2011-01-13            <NA>
## 15:     0006 2011-01-15                 0                1       2011-01-15      2011-01-15
## 16:     0006 2011-01-16                 0                0       2011-01-15      2011-01-15
## 17:     0006 2011-01-17                 0                1       2011-01-15      2011-01-17
## 18:     0007 2011-01-18                 1                0       2011-01-18            <NA>
## 19:     0007 2011-01-19                 0                0       2011-01-18            <NA>
## 20:     0008 2011-01-20                 0                0       2011-01-20            <NA>
## 21:     0009 2011-01-21                 0                0       2011-01-21            <NA>
## 22:     0009 2011-01-22                 1                0       2011-01-22            <NA>
## 23:     0010 2011-01-23                 0                1       2011-01-23      2011-01-23
## 24:     0010 2011-01-24                 0                1       2011-01-23      2011-01-24
## 25:     0010 2011-01-25                 1                0       2011-01-25      2011-01-24
## 26:     0010 2011-01-26                 0                0       2011-01-25      2011-01-24
## 27:     0011 2011-01-27                 0                0       2011-01-27            <NA>
## 28:     0011 2011-01-28                 0                0       2011-01-27            <NA>
## 29:     0012 2011-01-29                 0                1       2011-01-29      2011-01-29
## 30:     0012 2011-01-30                 0                0       2011-01-29      2011-01-29
## 31:     0012 2011-01-31                 1                0       2011-01-31      2011-01-29
## 32:     0012 2011-02-01                 0                0       2011-01-31      2011-01-29
## 33:     0012 2011-02-02                 0                0       2011-01-31      2011-01-29
## 34:     0013 2011-02-03                 0                0       2011-02-03            <NA>
## 35:     0013 2011-02-04                 1                0       2011-02-04            <NA>
## 36:     0013 2011-02-05                 1                0       2011-02-05            <NA>
## 37:     0014 2011-02-06                 0                1       2011-02-06      2011-02-06
## 38:     0014 2011-02-07                 0                0       2011-02-06      2011-02-06
## 39:     0014 2011-02-08                 0                0       2011-02-06      2011-02-06
## 40:     0014 2011-02-09                 0                1       2011-02-06      2011-02-09
## 41:     0014 2011-02-10                 1                0       2011-02-10      2011-02-09
## 42:     0015 2011-02-11                 0                0       2011-02-11            <NA>
## 43:     0015 2011-02-12                 0                0       2011-02-11            <NA>
## 44:     0015 2011-02-13                 0                0       2011-02-11            <NA>
## 45:     0015 2011-02-14                 0                1       2011-02-11      2011-02-14
## 46:     0016 2011-02-15                 1                0       2011-02-15            <NA>
## 47:     0016 2011-02-16                 0                0       2011-02-15            <NA>
## 48:     0017 2011-02-17                 0                0       2011-02-17            <NA>
## 49:     0018 2011-02-18                 0                0       2011-02-18            <NA>
## 50:     0018 2011-02-19                 0                0       2011-02-18            <NA>
##     NUMDCRED       FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_Entrada_Mora Ult_Salida_Mora

以下是您的新测试数据的演示:

## libs
library(data.table);

## generate test data
Poblacion_Morosa3 <- data.table(
    NUMDCRED=c('0001','0001','0001','0002','0002','0002','0003','0003','0003'),
    FDES=c('2012-01-01','2012-03-01','2012-04-01','2011-01-01','2011-02-01','2011-03-01','2012-05-01','2012-06-01','2012-07-01'),
    Flag_Entrada_Mora=c(0,1,0,0,0,0,0,0,0),
    Flag_Salida_Mora=c(0,0,0,0,0,0,0,1,0)
);
Poblacion_Morosa3[,FDES:=as.IDate(FDES)]; ## require correct type for FDES

## solution
system.time({
    findLastIndex <- function(iall,imark) c(0L,imark)[findInterval(iall,imark)+1L];
    n2 <- nrow(Poblacion_Morosa3);
    row.seq <- seq_len(n2);
    num.start <- c(T,Poblacion_Morosa3[,NUMDCRED[-.N]!=NUMDCRED[-1L]]);
    entrada.fdes <- findLastIndex(row.seq,which(num.start | Poblacion_Morosa3[,Flag_Entrada_Mora==1]));
    Ult_Entrada_Mora <- Poblacion_Morosa3[entrada.fdes,FDES];
    salida.na <- findLastIndex(row.seq,which(num.start));
    salida.fdes <- findLastIndex(row.seq,which(Poblacion_Morosa3[,Flag_Salida_Mora==1]));
    Ult_Salida_Mora <- c(as.IDate(NA),Poblacion_Morosa3[,FDES])[ifelse(salida.fdes>=salida.na,salida.fdes+1L,1L)];
});
##   user  system elapsed
##  0.000   0.000   0.003
## show result
cbind(Poblacion_Morosa3,Ult_Entrada_Mora,Ult_Salida_Mora);
##    NUMDCRED       FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_Entrada_Mora Ult_Salida_Mora
## 1:     0001 2012-01-01                 0                0       2012-01-01            <NA>
## 2:     0001 2012-03-01                 1                0       2012-03-01            <NA>
## 3:     0001 2012-04-01                 0                0       2012-03-01            <NA>
## 4:     0002 2011-01-01                 0                0       2011-01-01            <NA>
## 5:     0002 2011-02-01                 0                0       2011-01-01            <NA>
## 6:     0002 2011-03-01                 0                0       2011-01-01            <NA>
## 7:     0003 2012-05-01                 0                0       2012-05-01            <NA>
## 8:     0003 2012-06-01                 0                1       2012-05-01      2012-06-01
## 9:     0003 2012-07-01                 0                0       2012-05-01      2012-06-01