Question

我合并了两个数据集，它不是一对一的关系。我现在必须根据时间延迟确定最佳匹配（行之间）。

我已经在MySQL和R中进行了尝试，却找不到任何东西。

我的初始数据如下：

data <- data.frame("sent_id" = c(1,1,2,2,3,3,3,4,4,4), 
     "recieved_id" = c(100,101,100,101,105,106,107,105,106,107), 
   "delay" = c('00:00:00','15:00:00','-00:14:59','00:00:01','23:00:05','00:01:00',
                   '-18:00:00','15:00:00','23:00:00','00:30:10'))

最后我想得到这样的东西：

data2 <- data.frame("sent_id" = c(1,1,2,2,3,3,3,4,4,4), 
     "recieved_id" = c(100,101,100,101,105,106,107,105,106,107), 
     "delay" = c('00:00:00','15:00:00','-00:14:59','00:00:01','23:00:05','00:01:00',
    '-18:00:00','15:00:00','23:00:00','00:30:10'), 
'best_match' = c(TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE))

Answer 1

消除了延迟中的负面信号，并执行了以下操作。

toupper(substr($12,1,1))

s="1,2,3,4,5,6,7,8,9,10,11,abc def ghi,end" awk 'BEGIN{FS=OFS=","} {$12 = toupper(substr($12,1,1)) substr($12, 2)}1' <<< "$s" # => 1,2,3,4,5,6,7,8,9,10,11,Abc def ghi,end test_data <- data.frame("sent_id" = c(1,1,2,2,3,3,3,4,4,4), "recieved_id" = c(100,101,100,101,105,106,107,105,106,107), "delay" = c('00:00:00','15:00:00','00:14:59','00:00:01','23:00:05','00:01:00','18:00:00','15:00:00','23:00:00','00:30:10'))

received_id <-unique(test_data$recieved_id)

sent_id_2 <-unique(test_data$sent_id)

library(dplyr)

将其清理为实际代码。但这会带你去那里。

根据行之间的最小差异查找最佳匹配

1 个答案: