根据列差异匹配R中的记录

时间:2018-10-16 14:52:19

标签: r

我有这个数据框:

df1<-data.frame(ID_NUMBER = c(7160015,22695229,22695230,7160016,7160017,22695198,7160018,22695199,7160019,22695200,7160020,22695232,7160030,22697153,22697158,7162962,22698039,22698041,7162964) 
, CalNumber = c(9662.37,9662.45,9663.41,9663.44,9665.97,9666.11,9667.04,9667.1,9667.87,9668.01,9668.74,9668.79,9868.2, 72719.75,72723.21,99774,99774.03,99776.11,99776.13)
,Inspection_Date = c('11/13/2009','10/8/2014','10/8/2014','11/13/2009','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','10/8/2014','11/13/2009','10/8/2014','10/8/2014','11/13/2009'))

我试图根据与CalNumber的最接近值(其绝对差为<= 1)将10/08/2014记录与11/13/2009记录匹配。记录按CalNumber排序。最小的11/13/2009记录匹配可以是2014年10月8日之前或之后的记录。一旦2014年10月8日的记录与最接近的11/13/2009记录匹配,则不再考虑该11/13/2009记录。

很抱歉,如果令人困惑。希望这可以更好地解释它。这就是最终结果集的样子。

df1<-data.frame(ID_NUMBER = c(7160015,22695229,22695230,7160016,7160017,22695198,7160018,22695199,7160019,22695200,7160020,22695232,7160030,22697153,22697158,7162962,22698039,22698041,7162964) 
, CalNumber = c(9662.37,9662.45,9663.41,9663.44,9665.97,9666.11,9667.04,9667.1,9667.87,9668.01,9668.74,9668.79,9868.2, 72719.75,72723.21,99774,99774.03,99776.11,99776.13)
,Inspection_Date = c('11/13/2009','10/8/2014','10/8/2014','11/13/2009','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','10/8/2014','11/13/2009','10/8/2014','10/8/2014','11/13/2009')
,Diff = c(NA,0.08,0.03,NA,NA,0.14,NA,0.06,NA,0.14,NA,0.05,NA, NA,NA,NA,0.03,0.02,NA)
,MatchID = c(NA,7160015,7160016,NA,NA,7160017,NA,7160018,NA,7160019,NA,7160020,NA, NA,NA,NA,7162962,7162964,NA))

最终结果集还有2列。差异(abs(CalNumber)<= 1),这是基于CalNumber的最接近记录的绝对差异。 MatchID,它是最接近的记录的对应ID_Number。如果2014年10月8日不符合<= 1,则将其留空。 2009年11月13日的所有MatchID列均为空白。仅针对2014年10月8日记录中与2009年11月13日最接近的匹配项填充MatchID。

提前谢谢!

2 个答案:

答案 0 :(得分:2)

我对data.table来说还比较陌生,所以请忍受:

library(data.table)

dt1 <- data.table(df1)
dt2 <- copy(dt1)

setnames(dt2, c("ID_NUMBER", "CalNumber", "Inspection_Date"), c("ID_NUMBER2", "CalNumber2", "Inspection_Date2"))

dt2[dt1,
    .(ID_NUMBER,
      CalNumber,
      Inspection_Date,
      Diff = abs(CalNumber - CalNumber2),
      MatchID = ID_NUMBER2),
    on = .(Inspection_Date2 > Inspection_Date),
    allow.cartesian = TRUE
    ][,
      .SD[which.min(ifelse(is.na(Diff), Inf, Diff))],
      by = .(ID_NUMBER, CalNumber, Inspection_Date)
      ][,
        .(ID_NUMBER,
          CalNumber,
          Inspection_Date,
          Diff = ifelse(Diff > 1, NA, Diff),
          MatchID = ifelse(Diff > 1, NA, MatchID))
        ]

    ID_NUMBER CalNumber Inspection_Date Diff MatchID
 1:   7160015   9662.37      11/13/2009   NA      NA
 2:  22695229   9662.45       10/8/2014 0.08 7160015
 3:  22695230   9663.41       10/8/2014 0.03 7160016
 4:   7160016   9663.44      11/13/2009   NA      NA
 5:   7160017   9665.97      11/13/2009   NA      NA
 6:  22695198   9666.11       10/8/2014 0.14 7160017
 7:   7160018   9667.04      11/13/2009   NA      NA
 8:  22695199   9667.10       10/8/2014 0.06 7160018
 9:   7160019   9667.87      11/13/2009   NA      NA
10:  22695200   9668.01       10/8/2014 0.14 7160019
11:   7160020   9668.74      11/13/2009   NA      NA
12:  22695232   9668.79       10/8/2014 0.05 7160020
13:   7160030   9868.20      11/13/2009   NA      NA
14:  22697153  72719.75       10/8/2014   NA      NA
15:  22697158  72723.21       10/8/2014   NA      NA
16:   7162962  99774.00      11/13/2009   NA      NA
17:  22698039  99774.03       10/8/2014 0.03 7162962
18:  22698041  99776.11       10/8/2014 0.02 7162964
19:   7162964  99776.13      11/13/2009   NA      NA

dt1的复制是因为在自我联接期间我在引用列时遇到问题。我还怀疑某些操作可以合并,因此非常欢迎其他用户的输入。

逻辑:

  • 第一组方括号执行dt1dt2的左不等分连接,计算Diff变量。 data.table的左联接语法有点怪异,但是它的作用是从dt2中获取与on参数中指定的行匹配的所有行

  • 第二组括号获取与组中的最小值匹配的记录。这里的值是一个稍有变化的Diff变量(请参阅this我曾寻求帮助的帖子)

  • 在最小NA大于1的情况下,第三括号将Diff分配给MatchIDDiff的值

答案 1 :(得分:2)

由于@zack的回答,我想我现在知道OP正在做什么。要找到最接近的匹配项,通常可以使用滚动连接:

setDT(df1)
df1[Inspection_Date == "10/8/2014", c("md", "mid") := 
  df1[Inspection_Date == "11/13/2009"][.SD, on=.(CalNumber), roll="nearest", 
    .(abs(x.CalNumber - i.CalNumber), x.ID_NUMBER)
  ]
]

# oh, and then wipe it out if diff > 1
df1[md > 1, c("md", "mid") := NA]


    ID_NUMBER CalNumber Inspection_Date Diff MatchID   md     mid
 1:   7160015   9662.37      11/13/2009   NA      NA   NA      NA
 2:  22695229   9662.45       10/8/2014 0.08 7160015 0.08 7160015
 3:  22695230   9663.41       10/8/2014 0.03 7160016 0.03 7160016
 4:   7160016   9663.44      11/13/2009   NA      NA   NA      NA
 5:   7160017   9665.97      11/13/2009   NA      NA   NA      NA
 6:  22695198   9666.11       10/8/2014 0.14 7160017 0.14 7160017
 7:   7160018   9667.04      11/13/2009   NA      NA   NA      NA
 8:  22695199   9667.10       10/8/2014 0.06 7160018 0.06 7160018
 9:   7160019   9667.87      11/13/2009   NA      NA   NA      NA
10:  22695200   9668.01       10/8/2014 0.14 7160019 0.14 7160019
11:   7160020   9668.74      11/13/2009   NA      NA   NA      NA
12:  22695232   9668.79       10/8/2014 0.05 7160020 0.05 7160020
13:   7160030   9868.20      11/13/2009   NA      NA   NA      NA
14:  22697153  72719.75       10/8/2014   NA      NA   NA      NA
15:  22697158  72723.21       10/8/2014   NA      NA   NA      NA
16:   7162962  99774.00      11/13/2009   NA      NA   NA      NA
17:  22698039  99774.03       10/8/2014 0.03 7162962 0.03 7162962
18:  22698041  99776.11       10/8/2014 0.02 7162964 0.02 7162964
19:   7162964  99776.13      11/13/2009   NA      NA   NA      NA

我正在根据OP的方式对特定日期进行硬编码...

  

我试图根据与CalNumber的最接近值(绝对差为<= 1)将2014年10月8日的记录与2009年11月13日的记录进行匹配。

...而zack的答案通常比较日期。 (请注意,为此您应使用正确的日期格式,例如df1[, Inspection_Date := as.IDate(Inspection_Date, "%m/%d/%Y")]


工作原理

关键部分是基于x[i, on=, roll=, j]x = df1[Inspection_Date == "11/13/2009"]中的条件的2009子集i = .SD = df1[Inspection_Date == "10/8/2014"]和2014子集on=的联接roll=

j的{​​{1}}内,可以使用前缀x[i, on=, roll=, j]x.*来消除常见的列名。