我有以下数据框
DF1
ID PMT_DATE
100 2015/01/01
100 2015/02/01
100 2015/04/01
200 2016/01/01
200 2016/02/01
和df2
ID DATE STATUS
100 2014/12/31 A
100 2015/03/15 B
200 2015/12/31 A
200 2016/06/01 C
我将使用df2中的STATUS列创建一个df1数据帧。如果df1中的PMT_DATE列大于或等于df2中的DATE列,则df2中的关联状态应放在新数据帧中。生成的数据框应如下所示
ID PMT_DATE STATUS
100 2015/01/01 A
100 2015/02/01 A
100 2015/04/01 B
200 2016/01/01 A
200 2016/02/01 A
通常我会加入这两个表,创建一个新列并使用mutate执行计算并删除我不再需要的列,但由于df1和df2中的ID列中有多个匹配,我可以没有完全实施这一战略。
编辑:对于多场比赛,我想要最新的状态。例如,ID == 100的最后一行将属于Status == A和Status == B,但我只想要状态B.此外,两个数据帧中的ID字段表示相同的事物(即ID的连接)是我想要的)。
我正在考虑
的内容new_df <- df1 %>% rowwise() $>% do() ...
但我不知道如何填补其余部分来实现我的需要。
答案 0 :(得分:2)
我不确定滚动加入是否可用于declare @users table(psn char)
declare @computers table(comp char)
declare @logs table(person varchar(1), computer varchar(1), at DATETIME2 NOT NULL DEFAULT SYSDATETIME())
declare @cnt int
set @cnt=1
WHILE @cnt < 50
BEGIN
insert @logs(person,computer) values (char(cast(DATEPART(nanosecond,SYSDATETIME())/1050 as int)%10+97), cast( DATEDIFF(nanosecond,cast(GETDATE () as datetime2),SYSDATETIME())/90 as int)%10+1)
SET @cnt = @cnt + 1;
WAITFOR DELAY '00:00:00.035';
END;
select person , computer as 'used computer ' , at from @logs;
WITH
recurrent_table as (
select 0 step, t.person perstart, t.computer compstart,t.at timestart, cast('' as nvarchar) chain , convert(varchar(1),'') perswait2chain , convert(varchar(1),'') compwait2chain, t.person persend , t.computer compend, t.at timend
from @logs t
union all
select 1 step, t.person perstart, t.computer compstart,t.at timestart, cast('' as nvarchar) chain , convert(varchar(1),'') perswait2chain , convert(varchar(1),'') compwait2chain, t.person persend , t.computer compend, t.at timend
from @logs t
union all
select step+1 step , r.perstart perstart, r.compstart compstart,r.timestart timestart, convert(nvarchar,r.chain + r.perswait2chain + r.compwait2chain ) chain , convert(varchar(1),t.person) perswait2chain ,convert(varchar(1),t.computer) compwait2chain , t.person persend , t.computer compend, t.at timend
from @logs t
, recurrent_table r
where (
( r.persend = t.person and step%2 = 1 ) or
( r.compend = t.computer and step%2 = 0 )
) and t.at > r.timend
)
select distinct timestart 'started at' , timend 'ended at', ct.perstart + ct.compstart + ct.chain + ct.perswait2chain + ct.compwait2chain 'chain of events'
from recurrent_table ct
where not exists ( select chain from recurrent_table rc
where cast (ct.perstart+ct.compstart+ct.chain+ct.persend+ct.compend as nvarchar) = cast (rc.chain+rc.persend+rc.compend as nvarchar)
or
cast (ct.perstart+ct.compstart+ct.chain+ct.perswait2chain+ct.compwait2chain as nvarchar) = cast (rc.chain+rc.perswait2chain+rc.compwait2chain as nvarchar)
or
cast (ct.perstart+ct.compstart+ct.chain+ct.persend+ct.compend as nvarchar) = cast (rc.perstart+rc.compstart+rc.chain as nvarchar)
or
( rc.perswait2chain != '' and cast (ct.perstart+ct.compstart+ct.chain+ct.perswait2chain+ct.compwait2chain as nvarchar) = cast (rc.perstart+rc.compstart+rc.chain as nvarchar) )
)
order by ct.timestart,ct.timend
option ( MAXRECURSION 32767)
;
。这就是我使用dplyr
来获取最新 data.table
:
STATUS
library(data.table) setDT(df2)[setDT(df1), on = .(ID, DATE = PMT_DATE), roll = Inf]
ID DATE STATUS
1: 100 2015-01-01 A
2: 100 2015-02-01 A
3: 100 2015-04-01 B
4: 200 2016-01-01 A
5: 200 2016-02-01 A
答案 1 :(得分:0)
以下是使用dplyr
的方法。我们的想法是首先加入表格并使用PMT_DATE >= DATE
过滤掉该行。我想您只想要最新的STATUS
(STATUS
与最新的DATE
相关联。)。
library(dplyr)
df1 %>%
left_join(df2, by="ID") %>%
filter(PMT_DATE >= DATE) %>%
group_by(ID, PMT_DATE) %>%
slice(n()) %>% # get the latest status
select(-DATE) %>%
ungroup()
# # A tibble: 5 x 3
# ID PMT_DATE STATUS
# <int> <chr> <chr>
# 1 100 2015/01/01 A
# 2 100 2015/02/01 A
# 3 100 2015/04/01 B
# 4 200 2016/01/01 A
# 5 200 2016/02/01 A