如果Date超出给定间隔,则在数据框中将值设置为NA

时间:2017-04-28 10:54:50

标签: r dataframe apply lapply

我有两个数据框,df1df2

df1包含不同产品X1X2等在不同时间的值。 df2包含某些产品的真实开始日期和结束日期。我希望将df2中给定日期间隔之外的值替换为NA,如最终表df3所示。

创建df1df2

df1=data.frame(matrix(NA,10,6))
df1[,1]=(c(seq(as.Date("2012-01-01"),as.Date("2012-10-01"),by="1 month")))
df1[,2]=c(1:10); df1[,3]=c(12:21); df1[,4]=c(0.5:10); df1[,5]=c(5:14); df1[,6]=c(10:19)
colnames(df1)=c("Date","X1","X2","X3","X4","X5")
df2=data.frame(matrix(data=c("X1","X2","X4","2012-02-01","2012-04-01","2012-06-01","2012-09-01","2012-06-01","2012-10-01"),3,3))
colnames(df2)=c("Name","Start","End")

输出:

   > df1
         Date X1 X2  X3 X4 X5
1  2012-01-01  1 12 0.5  5 10
2  2012-02-01  2 13 1.5  6 11
3  2012-03-01  3 14 2.5  7 12
4  2012-04-01  4 15 3.5  8 13
5  2012-05-01  5 16 4.5  9 14
6  2012-06-01  6 17 5.5 10 15
7  2012-07-01  7 18 6.5 11 16
8  2012-08-01  8 19 7.5 12 17
9  2012-09-01  9 20 8.5 13 18
10 2012-10-01 10 21 9.5 14 19
> df2
  Name      Start        End
1   X1 2012-02-01 2012-09-01
2   X2 2012-04-01 2012-06-01
3   X4 2012-06-01 2012-10-01

最终输出应如下所示:

 df3
       Date  X1  X2  X3 X4 X5
1  2012-01-01 NA NA 0.5 NA 10
2  2012-02-01  2 NA 1.5 NA 11
3  2012-03-01  3 NA 2.5 NA 12
4  2012-04-01  4 15 3.5 NA 13
5  2012-05-01  5 16 4.5 NA 14
6  2012-06-01  6 17 5.5 10 15
7  2012-07-01  7 NA 6.5 11 16
8  2012-08-01  8 NA 7.5 12 17
9  2012-09-01  9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19

3 个答案:

答案 0 :(得分:1)

我确信有更优雅的方式,但您可以创建符合条件的索引矩阵,如果元素在您的产品区间内且1,则将元素设置为NA 1}}如果不是。假设您正在处理数值,那么您可以将数据框与该索引矩阵相乘:

示例:

library(dplyr)
## Convert your dates to Date-objects:
df2 <- df2 %>% dplyr::mutate(Start = as.Date(Start), End = as.Date(End))

## Create a matrix of indices (TRUE/FALSE):
indMx <- lapply(names(df1)[-1], function(product){
            (df1$Date >= df2$Start[df2$Name == product]) & 
                    (df1$Date <= df2$End[df2$Name == product]) 
        }) %>% do.call('cbind',.)

## Multiply with NA^indMx, which gives you NA in place of FALSE and 
## 1 in place of TRUE:
df1[,-1] <- df1[,-1]*NA^indMx

df1
#          Date X1 X2  X3
# 1  2012-01-01  1 12 0.5
# 2  2012-02-01 NA 13 1.5
# 3  2012-03-01 NA 14 2.5
# 4  2012-04-01 NA NA 3.5
# 5  2012-05-01 NA NA 4.5
# 6  2012-06-01 NA NA  NA
# 7  2012-07-01 NA 18  NA
# 8  2012-08-01 NA 19  NA
# 9  2012-09-01 NA 20  NA
# 10 2012-10-01 10 21  NA

答案 1 :(得分:1)

以下是data.table的一个解决方案。使用非equi连接可能有更优雅的方法。

for(i in seq_len(nrow(df2))) df1[!(Date %between% df2[i,.(Start, End)]), df2[i, Name] := NA]

在这里,您将根据df2当前行中开始日期和结束日期之外的日期遍历df2,子集df1的每一行,然后将NA分配给df2中给出的变量。

返回

df1
          Date X1 X2  X3
 1: 2012-01-01 NA NA  NA
 2: 2012-02-01  2 NA  NA
 3: 2012-03-01  3 NA  NA
 4: 2012-04-01  4 15  NA
 5: 2012-05-01  5 16  NA
 6: 2012-06-01  6 17 5.5
 7: 2012-07-01  7 NA 6.5
 8: 2012-08-01  8 NA 7.5
 9: 2012-09-01  9 NA 8.5
10: 2012-10-01 NA NA 9.5

更新

如果数据构造为原始帖子中更新的数据,则首先运行此行以将df2中的Names变量转换为字符向量(作为因子开始)。然后上面的代码将适用于新数据集。

# convert data.frames to data.tables
setDT(df1)
setDT(df2)

# convert factor to character
df2[, Name := as.character(Name)]

数据

library(data.table)
# read in data
df1 <- fread("Date X1 X2  X3
2012-01-01  1 12 0.5
2012-02-01  2 13 1.5
2012-03-01  3 14 2.5
2012-04-01  4 15 3.5
2012-05-01  5 16 4.5
2012-06-01  6 17 5.5
2012-07-01  7 18 6.5
2012-08-01  8 19 7.5
2012-09-01  9 20 8.5
2012-10-01 10 21 9.5")

df2 <- fread("  Name      Start        End
X1 2012-02-01 2012-09-01
X2 2012-04-01 2012-06-01
X3 2012-06-01 2012-10-01")

# convert to date type
df1[, Date := as.Date(Date)]
df2[, c("Start", "End")  := .(as.Date(Start), as.Date(End))]

答案 2 :(得分:1)

使用dplyrtidyr ...

library(tidyr)
library(dplyr)

df3 <- df1 %>% gather(key=Name,value=value,-Date) %>% #convert to long form
  left_join(df2) %>% #merge in date limits
  mutate(ind=(as.Date(Date)>=as.Date(Start) & as.Date(Date)<=as.Date(End))) %>% #check valid 
  mutate(value=replace(value,!ind,NA)) %>% #replace invalid with NA
  select(Date,Name,value) %>% #remove unnecessary variables
  spread(key=Name,value=value) #convert back to rectangular form

df3
         Date X1 X2  X3 X4 X5
1  2012-01-01 NA NA 0.5 NA 10
2  2012-02-01  2 NA 1.5 NA 11
3  2012-03-01  3 NA 2.5 NA 12
4  2012-04-01  4 15 3.5 NA 13
5  2012-05-01  5 16 4.5 NA 14
6  2012-06-01  6 17 5.5 10 15
7  2012-07-01  7 NA 6.5 11 16
8  2012-08-01  8 NA 7.5 12 17
9  2012-09-01  9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19