基于具有多个匹配的另一个数据框突变列

时间:2017-09-06 16:08:49

标签: r dplyr

我有以下数据框

DF1

ID    PMT_DATE
100   2015/01/01
100   2015/02/01
100   2015/04/01 
200   2016/01/01
200   2016/02/01

和df2

ID    DATE         STATUS    
100   2014/12/31   A
100   2015/03/15   B
200   2015/12/31   A  
200   2016/06/01   C

我将使用df2中的STATUS列创建一个df1数据帧。如果df1中的PMT_DATE列大于或等于df2中的DATE列,则df2中的关联状态应放在新数据帧中。生成的数据框应如下所示

ID    PMT_DATE     STATUS
100   2015/01/01   A
100   2015/02/01   A
100   2015/04/01   B 
200   2016/01/01   A
200   2016/02/01   A

通常我会加入这两个表,创建一个新列并使用mutate执行计算并删除我不再需要的列,但由于df1和df2中的ID列中有多个匹配,我可以没有完全实施这一战略。

编辑:对于多场比赛,我想要最新的状态。例如,ID == 100的最后一行将属于Status == A和Status == B,但我只想要状态B.此外,两个数据帧中的ID字段表示相同的事物(即ID的连接)是我想要的)。

我正在考虑

的内容
new_df <- df1 %>% rowwise() $>% do() ...

但我不知道如何填补其余部分来实现我的需要。

2 个答案:

答案 0 :(得分:2)

我不确定滚动加入是否可用于declare @users table(psn char) declare @computers table(comp char) declare @logs table(person varchar(1), computer varchar(1), at DATETIME2 NOT NULL DEFAULT SYSDATETIME()) declare @cnt int set @cnt=1 WHILE @cnt < 50 BEGIN insert @logs(person,computer) values (char(cast(DATEPART(nanosecond,SYSDATETIME())/1050 as int)%10+97), cast( DATEDIFF(nanosecond,cast(GETDATE () as datetime2),SYSDATETIME())/90 as int)%10+1) SET @cnt = @cnt + 1; WAITFOR DELAY '00:00:00.035'; END; select person , computer as 'used computer ' , at from @logs; WITH recurrent_table as ( select 0 step, t.person perstart, t.computer compstart,t.at timestart, cast('' as nvarchar) chain , convert(varchar(1),'') perswait2chain , convert(varchar(1),'') compwait2chain, t.person persend , t.computer compend, t.at timend from @logs t union all select 1 step, t.person perstart, t.computer compstart,t.at timestart, cast('' as nvarchar) chain , convert(varchar(1),'') perswait2chain , convert(varchar(1),'') compwait2chain, t.person persend , t.computer compend, t.at timend from @logs t union all select step+1 step , r.perstart perstart, r.compstart compstart,r.timestart timestart, convert(nvarchar,r.chain + r.perswait2chain + r.compwait2chain ) chain , convert(varchar(1),t.person) perswait2chain ,convert(varchar(1),t.computer) compwait2chain , t.person persend , t.computer compend, t.at timend from @logs t , recurrent_table r where ( ( r.persend = t.person and step%2 = 1 ) or ( r.compend = t.computer and step%2 = 0 ) ) and t.at > r.timend ) select distinct timestart 'started at' , timend 'ended at', ct.perstart + ct.compstart + ct.chain + ct.perswait2chain + ct.compwait2chain 'chain of events' from recurrent_table ct where not exists ( select chain from recurrent_table rc where cast (ct.perstart+ct.compstart+ct.chain+ct.persend+ct.compend as nvarchar) = cast (rc.chain+rc.persend+rc.compend as nvarchar) or cast (ct.perstart+ct.compstart+ct.chain+ct.perswait2chain+ct.compwait2chain as nvarchar) = cast (rc.chain+rc.perswait2chain+rc.compwait2chain as nvarchar) or cast (ct.perstart+ct.compstart+ct.chain+ct.persend+ct.compend as nvarchar) = cast (rc.perstart+rc.compstart+rc.chain as nvarchar) or ( rc.perswait2chain != '' and cast (ct.perstart+ct.compstart+ct.chain+ct.perswait2chain+ct.compwait2chain as nvarchar) = cast (rc.perstart+rc.compstart+rc.chain as nvarchar) ) ) order by ct.timestart,ct.timend option ( MAXRECURSION 32767) ; 。这就是我使用dplyr来获取最新 data.table

STATUS
library(data.table)
setDT(df2)[setDT(df1), on = .(ID, DATE = PMT_DATE), roll = Inf]

数据

    ID       DATE STATUS
1: 100 2015-01-01      A
2: 100 2015-02-01      A
3: 100 2015-04-01      B
4: 200 2016-01-01      A
5: 200 2016-02-01      A

答案 1 :(得分:0)

以下是使用dplyr的方法。我们的想法是首先加入表格并使用PMT_DATE >= DATE过滤掉该行。我想您只想要最新的STATUSSTATUS与最新的DATE相关联。)。

library(dplyr)

df1 %>% 
  left_join(df2, by="ID") %>%
  filter(PMT_DATE >= DATE) %>% 
  group_by(ID, PMT_DATE) %>% 
  slice(n()) %>% # get the latest status
  select(-DATE) %>%
  ungroup()

# # A tibble: 5 x 3
#      ID   PMT_DATE STATUS
#   <int>      <chr>  <chr>
# 1   100 2015/01/01      A
# 2   100 2015/02/01      A
# 3   100 2015/04/01      B
# 4   200 2016/01/01      A
# 5   200 2016/02/01      A