所以我有下表:
Tab1:
Variable timestamp
s1 1053093896
s2 1053095216
s1 1053181616
s1 1053959216
s2 1054132016
和Tab2:
Variable timestamp
s1 1053129600
s2 1053820800
我想为tab1的时间戳大于tab2的时间戳提取每个变量的第一个匹配项。我寻求的结果如下:
Variable timestamp
s1 1053181616
s2 1054132016
答案 0 :(得分:2)
这是使用dplyr
包的一种方法。
我修改了数字以提高可读性。
df1 <- data.frame(variable = c("s1", "s2", "s1", "s1", "s2"),
timestamp = 1:5, stringsAsFactors = F)
df2 <- data.frame(variable = c("s1", "s2"),
timestamp = c(2, 4), stringsAsFactors = F)
> df1
variable timestamp
1 s1 1
2 s2 2
3 s1 3
4 s1 4
5 s2 5
> df2
variable timestamp
1 s1 2
2 s2 4
library(dplyr)
df1 %>% left_join(df2, by = "variable", suffix = c("", "_2")) %>%
filter(timestamp > timestamp_2) %>%
group_by(variable) %>%
slice(1) %>%
select(-timestamp_2)
# A tibble: 2 x 2
# Groups: variable [2]
variable timestamp
<chr> <int>
1 s1 3
2 s2 5
答案 1 :(得分:2)
这可以通过使用指示的逻辑表达式的左联接来完成:
library(sqldf)
sqldf("select b.Variable, min(a.timestamp) as timestamp
from tab2 b
left join tab1 a on a.Variable = b.Variable and a.timestamp > b.timestamp
group by b.Variable")
给予:
Variable timestamp
1 s1 1053181616
2 s2 1054132016
Lines1 <- "Variable timestamp
s1 1053093896
s2 1053095216
s1 1053181616
s1 1053959216
s2 1054132016"
tab1 <- read.table(text = Lines1, header = TRUE, strip.white = TRUE)
Lines2 <- "Variable timestamp
s1 1053129600
s2 1053820800"
tab2 <- read.table(text = Lines2, header = TRUE, strip.white = TRUE)
答案 2 :(得分:1)
非联接/合并解决方案是根据条件在Variable
过滤器timestamp
中通过Map
和tab1
并选择第一行和rbind
行列表。
do.call(rbind,
Map(function(x, y) tab1[with(tab1, which.max(Variable == x & timestamp > y)), ],
tab2$Variable, tab2$timestamp))
# Variable timestamp
#3 s1 1053181616
#5 s2 1054132016
答案 3 :(得分:0)
您可以通过对数据进行排序并使用mult = 'first'
选项来在data.table联接中进行操作
library(data.table)
# convert to data tables
setDT(tab1)
setDT(tab2)
# order data (unecessary if already ordered)
setorder(tab1, timestamp)
setorder(tab2, timestamp)
tab1[tab2, on = .(Variable, timestamp > timestamp), mult = 'first',
.(Variable, x.timestamp)]
# Variable x.timestamp
# 1: s1 1053181616
# 2: s2 1054132016
使用的数据
tab1 <- fread('
Variable timestamp
s1 1053093896
s2 1053095216
s1 1053181616
s1 1053959216
s2 1054132016
')
tab2 <- fread('
Variable timestamp
s1 1053129600
s2 1053820800
')