我正在做一个for循环来填充一个向量,但这个循环需要2个小时。我不知道是不是因为我可能会做一些无效的事情,或者只是因为R的循环很慢。我必须为这部分使用循环,因为我需要前一个值,所以我无法对操作进行矢量化。
我正在使用数据包data.table 我的笔记本电脑有8GB内存,英特尔酷睿i5 pro 2.3GHz R版本64位3.2.3
该表具有以下结构(按NUMDCRED和FDES升序排序):
NUMDCRED FDES Flag_Entrada_Mora Flag_Salida_Mora
0001 "2012-01-01" 0 0
0001 "2012-03-01" 1 0
0001 "2012-04-01" 0 0
0002 "2011-01-01" 0 0
0002 "2011-02-01" 0 0
0002 "2011-03-01" 0 0
0003 "2012-05-01" 0 0
0003 "2012-06-01" 0 1
0003 "2012-07-01" 0 0
代码使用Variable FDES,FLAG_Entrada_Mora和FLAG_Salida_Mora创建两个新变量Ult_Entrada_Mora和Ult_Salida_Mora。 Ult_Entrada_Mora注册NUMDCRED输入mora的最后日期,Ult_Salida_Mora注册NUMDCRED退出mora的最后日期。当每个NUMDCRED是第一个(我的意思是出现NUMDCRED的第一个日期)时,Ult_Entrada_Mora必须是FDES值,并且必须重复该日期,直到每次Flag_Entrada为1时更新日期,每次NUMDCRED为时都会更新Ult_Salida_Mora第一个必须注册一个NA值,直到由Flag_Salida_Mora更新,并且必须重复这个值,直到更新为止等等。
在我的代码中,First_Numdcred_Index为我提供了出现新NUMDCRED的行,如果i值属于其中一个索引,我需要检查%in%。 aux_entrada和aux_salida仅在我之前描述的事件之一发生时才更新。
上面示例的表格输出为
NUMDCRED FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_En_Mo
0001 "2012-01-01" 0 0 "2012-01-01"
0001 "2012-03-01" 1 0 "2012-03-01"
0001 "2012-04-01" 0 0 "2012-03-01"
0002 "2011-01-01" 0 0 "2011-01-01"
0002 "2011-02-01" 0 0 "2011-01-01"
0002 "2011-03-01" 0 0 "2011-01-01"
0003 "2012-05-01" 0 0 "2012-05-01"
0003 "2012-06-01" 0 1 "2012-05-01"
0003 "2012-07-01" 0 0 "2012-05-01"
Ult_Salida_Mora
NA
NA
NA
NA
NA
NA
NA
"2012-06-01"
"2012-06-01"
以下是我用来运行循环(n2 = 648,385
)
First_NumCred_index
是一个包含表的一系列行索引的向量。其长度为148,982
,等级为numeric
FDES
为IDate
,Flag_Entrada
和Flag_Salida
为numeric
。
n2 <- length(Poblacion_Morosa3$NUMDCRED)
Ult_Entrada_Mora <- seq(as.IDate("2020-01-01"),by = "month",length.out = n2)
#vector(mode = "character",length=n2)
Ult_Salida_Mora <- seq(as.IDate("2020-01-01"),by = "month",length.out = n2)
aux_entrada <- as.IDate("2005-01-01")
aux_salida <- as.IDate("2005-01-01")
for(i in 1:n2){
if(i %in% First_NumdCred_index){
aux_entrada <- Poblacion_Morosa3[i,FDES]
aux_salida <- NA
} else if(Poblacion_Morosa3[i,Flag_Entrada_Mora] == 1){
aux_entrada <- Poblacion_Morosa3[i,FDES]
} else if(Poblacion_Morosa3[i,Flag_Salida_Mora] == 1){
aux_salida <- Poblacion_Morosa3[i,FDES]
}
Ult_Entrada_Mora[i] <- aux_entrada
Ult_Salida_Mora[i] <- aux_salida
}
我想知道是否正常运行需要2个小时才能运行,或者我是否做得效率低下。
答案 0 :(得分:2)
在这里做了我不确定你是否正在尝试做的事情:
library(data.table)
set.seed(123)
ex <- data.table(FDES=sample(seq(as.IDate("2001-01-01"),by="month",length=100),
1000,replace=T),
flag_entrance=sample(c(0,1),1000,replace=T),
flag_exit=sample(c(0,1),1000,replace=T))
First_NumCred_index <- sample(1:nrow(ex),250,replace=F)
> ex
FDES flag_entrance flag_exit
1: 2003-05-01 0 0
2: 2007-07-01 1 0
3: 2004-05-01 0 0
4: 2008-05-01 1 1
5: 2008-11-01 1 0
---
996: 2007-11-01 1 0
997: 2006-05-01 0 1
998: 2004-04-01 1 0
999: 2006-11-01 0 0
1000: 2001-11-01 0 1
现在我们可以在几个过程中处理这个问题。你甚至可以让它快一点,但这似乎足够快......
ex[,`:=`(date.seq.1=as.IDate(NA_integer_,origin="1970-01-01"),
date.seq.2=as.IDate(NA_integer_,origin="1970-01-01"))]
ex[First_NumCred_index,date.seq.1:=FDES]
ex[flag_entrance==1,date.seq.1:=FDES]
ex[flag_exit==1,date.seq.2:=FDES]
> ex
FDES flag_entrance flag_exit date.seq.1 date.seq.2
1: 2003-05-01 0 0 <NA> <NA>
2: 2007-07-01 1 0 2007-07-01 <NA>
3: 2004-05-01 0 0 2004-05-01 <NA>
4: 2008-05-01 1 1 2008-05-01 2008-05-01
5: 2008-11-01 1 0 2008-11-01 <NA>
---
996: 2007-11-01 1 0 2007-11-01 <NA>
997: 2006-05-01 0 1 <NA> 2006-05-01
998: 2004-04-01 1 0 2004-04-01 <NA>
999: 2006-11-01 0 0 <NA> <NA>
1000: 2001-11-01 0 1 <NA> 2001-11-01
所以你保留了你的NAs日期序列,你(显然?)想要它们,并可以将它们恢复为ex[,date.seq.1]
等的载体。
我猜我没理解你的问题。 特别是,您说您需要有时参考前一行的值。如果是这种情况,您可以将上述建议与对shift
的调用结合起来。例如,如果您需要在条件满足时采用前一行的值,否则使用当前行的值,&#34;你可以做点什么
ex[,date.seq.3:=ifelse( condition, shift(FDES), FDES)]
最佳。
修改以展开我的评论。如果你想要的只是&#34;继续重复最后的日期直到你看到1,然后改为后续的日期,&#34;然后你可以尝试这样的事情:
> ex[,.(FDES,flag_entrance,FDES[cumsum(rle(flag_entrance)$values)])]
FDES flag_entrance V3
1: 2003-05-01 0 2003-05-01
2: 2007-07-01 1 2003-05-01
3: 2004-05-01 0 2007-07-01
4: 2008-05-01 1 2007-07-01
5: 2008-11-01 1 2004-05-01
---
如果您在data.table中复制此向量而不是仅仅抓取向量,请小心回收。
答案 1 :(得分:1)
我怀疑循环中的%in%
操作占用了大部分时间。您可以通过以下方式预先计算循环结果来删除它:
isFirstNumdCred <- (1:n2) %in% First_NumdCred_index
for(i in 1:n2){
if(isFirstNumdCred[i]){
aux_entrada <- Poblacion_Morosa3[i,FDES]
aux_salida <- NA
} else if(Poblacion_Morosa3[i,Flag_Entrada_Mora] == 1){
aux_entrada <- Poblacion_Morosa3[i,FDES]
} else if(Poblacion_Morosa3[i,Flag_Salida_Mora] == 1){
aux_salida <- Poblacion_Morosa3[i,FDES]
}
Ult_Entrada_Mora[i] <- aux_entrada
Ult_Salida_Mora[i] <- aux_salida
}
答案 2 :(得分:1)
在我看来,findInterval()
是解决这个问题最合适的功能。您的中间变量基本上保留其先前的值,除了行序列中的已知标记,它们更改为已知值,固定(NA
)或在输入框架(FDES
列)中查找。我们可以使用findInterval()
根据所需的逻辑找到最接近的先前标记,并使用获胜标记索引索引目标值的向量。
## libs
library(data.table);
## generate test data
set.seed(4L);
n2 <- 648385L;
Poblacion_Morosa3 <- data.table(
NUMDCRED=sprintf('%04d',cumsum(c(T,sample(c(rep(F,3L),T),n2-1L,replace=T)))), ## avg 4 rows per num
FDES=seq(as.IDate('2011-01-01'),by=1,len=n2),
Flag_Entrada_Mora=sample(c(rep(0L,5L),1L),n2,replace=T), ## avg 6 rows per flag
Flag_Salida_Mora=sample(c(rep(0L,5L),1L),n2,replace=T) ## ditto
);
## solution
system.time({
findLastIndex <- function(iall,imark) c(0L,imark)[findInterval(iall,imark)+1L];
n2 <- nrow(Poblacion_Morosa3);
row.seq <- seq_len(n2);
num.start <- c(T,Poblacion_Morosa3[,NUMDCRED[-.N]!=NUMDCRED[-1L]]);
entrada.fdes <- findLastIndex(row.seq,which(num.start | Poblacion_Morosa3[,Flag_Entrada_Mora==1]));
Ult_Entrada_Mora <- Poblacion_Morosa3[entrada.fdes,FDES];
salida.na <- findLastIndex(row.seq,which(num.start));
salida.fdes <- findLastIndex(row.seq,which(Poblacion_Morosa3[,Flag_Salida_Mora==1]));
Ult_Salida_Mora <- c(as.IDate(NA),Poblacion_Morosa3[,FDES])[ifelse(salida.fdes>=salida.na,salida.fdes+1L,1L)];
});
## user system elapsed
## 0.328 0.047 0.374
## show result
head(cbind(Poblacion_Morosa3,Ult_Entrada_Mora,Ult_Salida_Mora),50L);
## NUMDCRED FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_Entrada_Mora Ult_Salida_Mora
## 1: 0001 2011-01-01 0 0 2011-01-01 <NA>
## 2: 0001 2011-01-02 0 0 2011-01-01 <NA>
## 3: 0001 2011-01-03 1 0 2011-01-03 <NA>
## 4: 0001 2011-01-04 0 0 2011-01-03 <NA>
## 5: 0001 2011-01-05 0 0 2011-01-03 <NA>
## 6: 0002 2011-01-06 0 0 2011-01-06 <NA>
## 7: 0002 2011-01-07 0 0 2011-01-06 <NA>
## 8: 0002 2011-01-08 0 0 2011-01-06 <NA>
## 9: 0003 2011-01-09 1 0 2011-01-09 <NA>
## 10: 0004 2011-01-10 1 0 2011-01-10 <NA>
## 11: 0004 2011-01-11 0 0 2011-01-10 <NA>
## 12: 0005 2011-01-12 0 0 2011-01-12 <NA>
## 13: 0005 2011-01-13 1 0 2011-01-13 <NA>
## 14: 0005 2011-01-14 0 0 2011-01-13 <NA>
## 15: 0006 2011-01-15 0 1 2011-01-15 2011-01-15
## 16: 0006 2011-01-16 0 0 2011-01-15 2011-01-15
## 17: 0006 2011-01-17 0 1 2011-01-15 2011-01-17
## 18: 0007 2011-01-18 1 0 2011-01-18 <NA>
## 19: 0007 2011-01-19 0 0 2011-01-18 <NA>
## 20: 0008 2011-01-20 0 0 2011-01-20 <NA>
## 21: 0009 2011-01-21 0 0 2011-01-21 <NA>
## 22: 0009 2011-01-22 1 0 2011-01-22 <NA>
## 23: 0010 2011-01-23 0 1 2011-01-23 2011-01-23
## 24: 0010 2011-01-24 0 1 2011-01-23 2011-01-24
## 25: 0010 2011-01-25 1 0 2011-01-25 2011-01-24
## 26: 0010 2011-01-26 0 0 2011-01-25 2011-01-24
## 27: 0011 2011-01-27 0 0 2011-01-27 <NA>
## 28: 0011 2011-01-28 0 0 2011-01-27 <NA>
## 29: 0012 2011-01-29 0 1 2011-01-29 2011-01-29
## 30: 0012 2011-01-30 0 0 2011-01-29 2011-01-29
## 31: 0012 2011-01-31 1 0 2011-01-31 2011-01-29
## 32: 0012 2011-02-01 0 0 2011-01-31 2011-01-29
## 33: 0012 2011-02-02 0 0 2011-01-31 2011-01-29
## 34: 0013 2011-02-03 0 0 2011-02-03 <NA>
## 35: 0013 2011-02-04 1 0 2011-02-04 <NA>
## 36: 0013 2011-02-05 1 0 2011-02-05 <NA>
## 37: 0014 2011-02-06 0 1 2011-02-06 2011-02-06
## 38: 0014 2011-02-07 0 0 2011-02-06 2011-02-06
## 39: 0014 2011-02-08 0 0 2011-02-06 2011-02-06
## 40: 0014 2011-02-09 0 1 2011-02-06 2011-02-09
## 41: 0014 2011-02-10 1 0 2011-02-10 2011-02-09
## 42: 0015 2011-02-11 0 0 2011-02-11 <NA>
## 43: 0015 2011-02-12 0 0 2011-02-11 <NA>
## 44: 0015 2011-02-13 0 0 2011-02-11 <NA>
## 45: 0015 2011-02-14 0 1 2011-02-11 2011-02-14
## 46: 0016 2011-02-15 1 0 2011-02-15 <NA>
## 47: 0016 2011-02-16 0 0 2011-02-15 <NA>
## 48: 0017 2011-02-17 0 0 2011-02-17 <NA>
## 49: 0018 2011-02-18 0 0 2011-02-18 <NA>
## 50: 0018 2011-02-19 0 0 2011-02-18 <NA>
## NUMDCRED FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_Entrada_Mora Ult_Salida_Mora
以下是您的新测试数据的演示:
## libs
library(data.table);
## generate test data
Poblacion_Morosa3 <- data.table(
NUMDCRED=c('0001','0001','0001','0002','0002','0002','0003','0003','0003'),
FDES=c('2012-01-01','2012-03-01','2012-04-01','2011-01-01','2011-02-01','2011-03-01','2012-05-01','2012-06-01','2012-07-01'),
Flag_Entrada_Mora=c(0,1,0,0,0,0,0,0,0),
Flag_Salida_Mora=c(0,0,0,0,0,0,0,1,0)
);
Poblacion_Morosa3[,FDES:=as.IDate(FDES)]; ## require correct type for FDES
## solution
system.time({
findLastIndex <- function(iall,imark) c(0L,imark)[findInterval(iall,imark)+1L];
n2 <- nrow(Poblacion_Morosa3);
row.seq <- seq_len(n2);
num.start <- c(T,Poblacion_Morosa3[,NUMDCRED[-.N]!=NUMDCRED[-1L]]);
entrada.fdes <- findLastIndex(row.seq,which(num.start | Poblacion_Morosa3[,Flag_Entrada_Mora==1]));
Ult_Entrada_Mora <- Poblacion_Morosa3[entrada.fdes,FDES];
salida.na <- findLastIndex(row.seq,which(num.start));
salida.fdes <- findLastIndex(row.seq,which(Poblacion_Morosa3[,Flag_Salida_Mora==1]));
Ult_Salida_Mora <- c(as.IDate(NA),Poblacion_Morosa3[,FDES])[ifelse(salida.fdes>=salida.na,salida.fdes+1L,1L)];
});
## user system elapsed
## 0.000 0.000 0.003
## show result
cbind(Poblacion_Morosa3,Ult_Entrada_Mora,Ult_Salida_Mora);
## NUMDCRED FDES Flag_Entrada_Mora Flag_Salida_Mora Ult_Entrada_Mora Ult_Salida_Mora
## 1: 0001 2012-01-01 0 0 2012-01-01 <NA>
## 2: 0001 2012-03-01 1 0 2012-03-01 <NA>
## 3: 0001 2012-04-01 0 0 2012-03-01 <NA>
## 4: 0002 2011-01-01 0 0 2011-01-01 <NA>
## 5: 0002 2011-02-01 0 0 2011-01-01 <NA>
## 6: 0002 2011-03-01 0 0 2011-01-01 <NA>
## 7: 0003 2012-05-01 0 0 2012-05-01 <NA>
## 8: 0003 2012-06-01 0 1 2012-05-01 2012-06-01
## 9: 0003 2012-07-01 0 0 2012-05-01 2012-06-01