根据与输入df的逻辑比较返回子集

时间:2019-07-03 02:01:50

标签: r

使用R,我尝试为每个ID返回一行,通过df_filter中的目标进行过滤,或者基于过滤器的下一个最低整数。

原始df:

df:
id   year       
1    2019    
1    2018   
1    2005   
1    2004    
2    2018   
2    2017   
3    1998  
3    1997
3    1996
3    1995

过滤器:

df_filter:
id   year       
1    2017  
2    2018
3    2000

结果数据框应如下所示:

dfnew:
id   year
1    2005
2    2017
3    1998

2 个答案:

答案 0 :(得分:2)

使用dplyr,我们可以left_join dfdf_filter by idgroup_by id,{ {1}} arrange降序排列,并在两年之间的差异小于0时选择第一行。

year

答案 1 :(得分:2)

我们可以使用data.table

library(data.table)
setDT(df)[df_filter, on = .(id)][year != i.year, 
      .(year = year[which(year  < i.year)[1]]), id]
#   id year
#1:  1 2005
#2:  2 2017
#3:  3 1998

或使用non-equi连接

setDT(df)[, year1 := year][df_filter, .(id, year), 
         on = .(id, year1 < year), mult = 'first']
#    id year
#1:  1 2005
#2:  2 2017
#3:  3 1998

或者在原始数据集中未分配(:=

setDT(df)[, .(year1 = year, year, id)][df_filter, .(id, year),
        on = .(id, year1 < year), mult = 'first']

或@thelatemail评论

setDT(df)[df_filter, on=.(id, year < year), .(yearM = max(x.year)),
          by=.EACHI][, .(id, year = yearM)]

或将tidyversefuzzyjoin一起使用

library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(df, df_filter, by = c("id", "year"),
       match_fun = list(`==`, `<`)) %>% 
  group_by(id = id.x) %>%
  summarise(year = year.x[which(year.x < year.y)[1]])
# A tibble: 3 x 2
#     id year
#  <int> <int>
#1     1  2005
#2     2  2017
#3     3  1998

数据

df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), 
    year = c(2019L, 2018L, 2005L, 2004L, 2018L, 2017L, 1998L, 
    1997L, 1996L, 1995L)), class = "data.frame", row.names = c(NA, 
-10L))
df_filter <- structure(list(id = 1:3, year = c(2017L, 2018L, 2000L)), 
   class = "data.frame", row.names = c(NA, -3L))