使用R,我尝试为每个ID返回一行,通过df_filter中的目标进行过滤,或者基于过滤器的下一个最低整数。
原始df:
df:
id year
1 2019
1 2018
1 2005
1 2004
2 2018
2 2017
3 1998
3 1997
3 1996
3 1995
过滤器:
df_filter:
id year
1 2017
2 2018
3 2000
结果数据框应如下所示:
dfnew:
id year
1 2005
2 2017
3 1998
答案 0 :(得分:2)
使用dplyr
,我们可以left_join
df
和df_filter
by
id
,group_by
id
,{ {1}} arrange
降序排列,并在两年之间的差异小于0时选择第一行。
year
答案 1 :(得分:2)
我们可以使用data.table
library(data.table)
setDT(df)[df_filter, on = .(id)][year != i.year,
.(year = year[which(year < i.year)[1]]), id]
# id year
#1: 1 2005
#2: 2 2017
#3: 3 1998
或使用non-equi
连接
setDT(df)[, year1 := year][df_filter, .(id, year),
on = .(id, year1 < year), mult = 'first']
# id year
#1: 1 2005
#2: 2 2017
#3: 3 1998
或者在原始数据集中未分配(:=
)
setDT(df)[, .(year1 = year, year, id)][df_filter, .(id, year),
on = .(id, year1 < year), mult = 'first']
或@thelatemail评论
setDT(df)[df_filter, on=.(id, year < year), .(yearM = max(x.year)),
by=.EACHI][, .(id, year = yearM)]
或将tidyverse
与fuzzyjoin
一起使用
library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(df, df_filter, by = c("id", "year"),
match_fun = list(`==`, `<`)) %>%
group_by(id = id.x) %>%
summarise(year = year.x[which(year.x < year.y)[1]])
# A tibble: 3 x 2
# id year
# <int> <int>
#1 1 2005
#2 2 2017
#3 3 1998
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
year = c(2019L, 2018L, 2005L, 2004L, 2018L, 2017L, 1998L,
1997L, 1996L, 1995L)), class = "data.frame", row.names = c(NA,
-10L))
df_filter <- structure(list(id = 1:3, year = c(2017L, 2018L, 2000L)),
class = "data.frame", row.names = c(NA, -3L))