在没有循环的情况下填充值

时间:2015-09-07 12:12:35

标签: r loops merge dataframe

我有一个大型数据框x,其中包含特定日期的股票价格。我想将此数据集与日期变量合并,并填写最后一个已知的x的obervation,直到下一个spedific日期,以便我得到数据帧z。以下示例显示了一种股票。

我正在使用循环,但过程非常缓慢,因为我有五到十年的每日数据和数千种股票。

还有另一种方法吗?在Matlab中,相同的代码运行得更快。

重要的是我还可以使用替代条件而不是简单的is.na(z [t,2] == TRUE条件。

以下是示例:

> x=data.frame(c("2015-05-31","2015-06-30","2015-07-31"),c(100,200,150))
> colnames(x)=c("Date","AAPL")
> x[,1]=as.Date(x[,1],origin="1970-01-01")
> 
> x
        Date AAPL
1 2015-05-31  100
2 2015-06-30  200
3 2015-07-31  150
> 
> date=data.frame(c("2015-05-31","2015-06-01","2015-06-02","2015-06-03","2015-06-04","2015-06-05","2015-06-06","2015-06-07","2015-06-08","2015-06-09","2015-06-10","2015-06-11","2015-06-12","2015-06-13","2015-06-14","2015-06-15","2015-06-16","2015-06-17","2015-06-18","2015-06-19","2015-06-20","2015-06-21","2015-06-22","2015-06-23","2015-06-24","2015-06-25","2015-06-26","2015-06-27","2015-06-28","2015-06-29","2015-06-30","2015-07-01","2015-07-02","2015-07-03","2015-07-04","2015-07-05","2015-07-06","2015-07-07","2015-07-08","2015-07-09","2015-07-10","2015-07-11","2015-07-12","2015-07-13","2015-07-14","2015-07-15","2015-07-16","2015-07-17","2015-07-18","2015-07-19","2015-07-20","2015-07-21","2015-07-22","2015-07-23","2015-07-24","2015-07-25","2015-07-26","2015-07-27","2015-07-28","2015-07-29","2015-07-30","2015-07-31"))
> colnames(date)=c("Date")
> date[,1]=as.Date(date[,1],origin="1970-01-01")
> 
> date
         Date
1  2015-05-31
2  2015-06-01
3  2015-06-02
29 ...
30 2015-06-29
31 2015-06-30
32 2015-07-01
33 2015-07-02

> 
> z=merge(x=x, y=date, by.x="Date", by.y="Date",all.y=TRUE)
> 
> 
> #Converting x to a data matrix speeds up the loop
> z=data.matrix(z) 
> 
> for (t in 1:nrow(z)) {
+   if (is.na(z[t,2]==TRUE)){
+     z[t,2]=z[t-1,2]
+   } else if (is.na(z[t,2]==TRUE)){
+     z[t,2]=z[t,2]
+   }
+ }
> 
> z=as.data.frame(z)
> z[,1]=as.Date(z[,1],origin="1970-01-01")
> 
> z
         Date AAPL
1  2015-05-31  100
2  2015-06-01  100
3  2015-06-02  100
29 ...
30 2015-06-29  100
31 2015-06-30  200
32 2015-07-01  200
33 2015-07-02  200

4 个答案:

答案 0 :(得分:3)

使用dplyrzoo包对我有用:

library(dplyr)
library(zoo)

my_new_df <-
  right_join(x, date) %>% 
  mutate(y = na.locf(AAPL))

head(my_new_df)

        Date AAPL   y
1 2015-05-31  100 100
2 2015-06-01   NA 100
3 2015-06-02   NA 100
4 2015-06-03   NA 100
5 2015-06-04   NA 100
6 2015-06-05   NA 100

tail(my_new_df)

         Date AAPL   y
57 2015-07-26   NA 200
58 2015-07-27   NA 200
59 2015-07-28   NA 200
60 2015-07-29   NA 200
61 2015-07-30   NA 200
62 2015-07-31  150 150

答案 1 :(得分:2)

您可以尝试简洁的data.table解决方案(并快速):

library(data.table)
setkey(setDT(x),Date)[setDT(date), roll=T]

答案 2 :(得分:2)

我们可以使用base R来执行此操作。我们得到非NA&AAPL&#39;的逻辑索引。元素(&#39; i1&#39;),cumsum&#39; i1&#39;要转换为numeric索引,请使用该索引将NA元素替换为非NA元素。

i1 <- !is.na(z$AAPL)
z$AAPL <- z$AAPL[i1][cumsum(i1)]
head(z)
#        Date AAPL
#1 2015-05-31  100
#2 2015-06-01  100
#3 2015-06-02  100
#4 2015-06-03  100
#5 2015-06-04  100
#6 2015-06-05  100
 tail(z)
#         Date AAPL
#57 2015-07-26  200
#58 2015-07-27  200
#59 2015-07-28  200
#60 2015-07-29  200
#61 2015-07-30  200
#62 2015-07-31  150

答案 3 :(得分:0)

如果您决定使用时间序列,例如zoo然后可以使用 zoo 包中的na.locf轻松完成此操作。这是一些info