我合并了两个数据集,它不是一对一的关系。我现在必须根据时间延迟确定最佳匹配(行之间)。
我已经在MySQL和R中进行了尝试,却找不到任何东西。
我的初始数据如下:
data <- data.frame("sent_id" = c(1,1,2,2,3,3,3,4,4,4),
"recieved_id" = c(100,101,100,101,105,106,107,105,106,107),
"delay" = c('00:00:00','15:00:00','-00:14:59','00:00:01','23:00:05','00:01:00',
'-18:00:00','15:00:00','23:00:00','00:30:10'))
最后我想得到这样的东西:
data2 <- data.frame("sent_id" = c(1,1,2,2,3,3,3,4,4,4),
"recieved_id" = c(100,101,100,101,105,106,107,105,106,107),
"delay" = c('00:00:00','15:00:00','-00:14:59','00:00:01','23:00:05','00:01:00',
'-18:00:00','15:00:00','23:00:00','00:30:10'),
'best_match' = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE))
答案 0 :(得分:0)
消除了延迟中的负面信号,并执行了以下操作。
toupper(substr($12,1,1))
s="1,2,3,4,5,6,7,8,9,10,11,abc def ghi,end"
awk 'BEGIN{FS=OFS=","} {$12 = toupper(substr($12,1,1)) substr($12, 2)}1' <<< "$s"
# => 1,2,3,4,5,6,7,8,9,10,11,Abc def ghi,end
test_data <- data.frame("sent_id" = c(1,1,2,2,3,3,3,4,4,4), "recieved_id" = c(100,101,100,101,105,106,107,105,106,107), "delay" = c('00:00:00','15:00:00','00:14:59','00:00:01','23:00:05','00:01:00','18:00:00','15:00:00','23:00:00','00:30:10'))
received_id <-unique(test_data$recieved_id)
sent_id_2 <-unique(test_data$sent_id)
library(dplyr)
将其清理为实际代码。但这会带你去那里。