我有两个数据框,df1
和df2
。
df1
包含不同产品X1
,X2
等在不同时间的值。 df2
包含某些产品的真实开始日期和结束日期。我希望将df2
中给定日期间隔之外的值替换为NA
,如最终表df3
所示。
创建df1
和df2
:
df1=data.frame(matrix(NA,10,6))
df1[,1]=(c(seq(as.Date("2012-01-01"),as.Date("2012-10-01"),by="1 month")))
df1[,2]=c(1:10); df1[,3]=c(12:21); df1[,4]=c(0.5:10); df1[,5]=c(5:14); df1[,6]=c(10:19)
colnames(df1)=c("Date","X1","X2","X3","X4","X5")
df2=data.frame(matrix(data=c("X1","X2","X4","2012-02-01","2012-04-01","2012-06-01","2012-09-01","2012-06-01","2012-10-01"),3,3))
colnames(df2)=c("Name","Start","End")
输出:
> df1
Date X1 X2 X3 X4 X5
1 2012-01-01 1 12 0.5 5 10
2 2012-02-01 2 13 1.5 6 11
3 2012-03-01 3 14 2.5 7 12
4 2012-04-01 4 15 3.5 8 13
5 2012-05-01 5 16 4.5 9 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 18 6.5 11 16
8 2012-08-01 8 19 7.5 12 17
9 2012-09-01 9 20 8.5 13 18
10 2012-10-01 10 21 9.5 14 19
> df2
Name Start End
1 X1 2012-02-01 2012-09-01
2 X2 2012-04-01 2012-06-01
3 X4 2012-06-01 2012-10-01
最终输出应如下所示:
df3
Date X1 X2 X3 X4 X5
1 2012-01-01 NA NA 0.5 NA 10
2 2012-02-01 2 NA 1.5 NA 11
3 2012-03-01 3 NA 2.5 NA 12
4 2012-04-01 4 15 3.5 NA 13
5 2012-05-01 5 16 4.5 NA 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 NA 6.5 11 16
8 2012-08-01 8 NA 7.5 12 17
9 2012-09-01 9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19
答案 0 :(得分:1)
我确信有更优雅的方式,但您可以创建符合条件的索引矩阵,如果元素在您的产品区间内且1
,则将元素设置为NA
1}}如果不是。假设您正在处理数值,那么您可以将数据框与该索引矩阵相乘:
示例:强>
library(dplyr)
## Convert your dates to Date-objects:
df2 <- df2 %>% dplyr::mutate(Start = as.Date(Start), End = as.Date(End))
## Create a matrix of indices (TRUE/FALSE):
indMx <- lapply(names(df1)[-1], function(product){
(df1$Date >= df2$Start[df2$Name == product]) &
(df1$Date <= df2$End[df2$Name == product])
}) %>% do.call('cbind',.)
## Multiply with NA^indMx, which gives you NA in place of FALSE and
## 1 in place of TRUE:
df1[,-1] <- df1[,-1]*NA^indMx
df1
# Date X1 X2 X3
# 1 2012-01-01 1 12 0.5
# 2 2012-02-01 NA 13 1.5
# 3 2012-03-01 NA 14 2.5
# 4 2012-04-01 NA NA 3.5
# 5 2012-05-01 NA NA 4.5
# 6 2012-06-01 NA NA NA
# 7 2012-07-01 NA 18 NA
# 8 2012-08-01 NA 19 NA
# 9 2012-09-01 NA 20 NA
# 10 2012-10-01 10 21 NA
答案 1 :(得分:1)
以下是data.table
的一个解决方案。使用非equi连接可能有更优雅的方法。
for(i in seq_len(nrow(df2))) df1[!(Date %between% df2[i,.(Start, End)]), df2[i, Name] := NA]
在这里,您将根据df2当前行中开始日期和结束日期之外的日期遍历df2,子集df1的每一行,然后将NA分配给df2中给出的变量。
返回
df1
Date X1 X2 X3
1: 2012-01-01 NA NA NA
2: 2012-02-01 2 NA NA
3: 2012-03-01 3 NA NA
4: 2012-04-01 4 15 NA
5: 2012-05-01 5 16 NA
6: 2012-06-01 6 17 5.5
7: 2012-07-01 7 NA 6.5
8: 2012-08-01 8 NA 7.5
9: 2012-09-01 9 NA 8.5
10: 2012-10-01 NA NA 9.5
更新
如果数据构造为原始帖子中更新的数据,则首先运行此行以将df2中的Names变量转换为字符向量(作为因子开始)。然后上面的代码将适用于新数据集。
# convert data.frames to data.tables
setDT(df1)
setDT(df2)
# convert factor to character
df2[, Name := as.character(Name)]
数据强>
library(data.table)
# read in data
df1 <- fread("Date X1 X2 X3
2012-01-01 1 12 0.5
2012-02-01 2 13 1.5
2012-03-01 3 14 2.5
2012-04-01 4 15 3.5
2012-05-01 5 16 4.5
2012-06-01 6 17 5.5
2012-07-01 7 18 6.5
2012-08-01 8 19 7.5
2012-09-01 9 20 8.5
2012-10-01 10 21 9.5")
df2 <- fread(" Name Start End
X1 2012-02-01 2012-09-01
X2 2012-04-01 2012-06-01
X3 2012-06-01 2012-10-01")
# convert to date type
df1[, Date := as.Date(Date)]
df2[, c("Start", "End") := .(as.Date(Start), as.Date(End))]
答案 2 :(得分:1)
使用dplyr
和tidyr
...
library(tidyr)
library(dplyr)
df3 <- df1 %>% gather(key=Name,value=value,-Date) %>% #convert to long form
left_join(df2) %>% #merge in date limits
mutate(ind=(as.Date(Date)>=as.Date(Start) & as.Date(Date)<=as.Date(End))) %>% #check valid
mutate(value=replace(value,!ind,NA)) %>% #replace invalid with NA
select(Date,Name,value) %>% #remove unnecessary variables
spread(key=Name,value=value) #convert back to rectangular form
df3
Date X1 X2 X3 X4 X5
1 2012-01-01 NA NA 0.5 NA 10
2 2012-02-01 2 NA 1.5 NA 11
3 2012-03-01 3 NA 2.5 NA 12
4 2012-04-01 4 15 3.5 NA 13
5 2012-05-01 5 16 4.5 NA 14
6 2012-06-01 6 17 5.5 10 15
7 2012-07-01 7 NA 6.5 11 16
8 2012-08-01 8 NA 7.5 12 17
9 2012-09-01 9 NA 8.5 13 18
10 2012-10-01 NA NA 9.5 14 19