我有这个数据框:
df1<-data.frame(ID_NUMBER = c(7160015,22695229,22695230,7160016,7160017,22695198,7160018,22695199,7160019,22695200,7160020,22695232,7160030,22697153,22697158,7162962,22698039,22698041,7162964)
, CalNumber = c(9662.37,9662.45,9663.41,9663.44,9665.97,9666.11,9667.04,9667.1,9667.87,9668.01,9668.74,9668.79,9868.2, 72719.75,72723.21,99774,99774.03,99776.11,99776.13)
,Inspection_Date = c('11/13/2009','10/8/2014','10/8/2014','11/13/2009','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','10/8/2014','11/13/2009','10/8/2014','10/8/2014','11/13/2009'))
我试图根据与CalNumber的最接近值(其绝对差为<= 1)将10/08/2014记录与11/13/2009记录匹配。记录按CalNumber排序。最小的11/13/2009记录匹配可以是2014年10月8日之前或之后的记录。一旦2014年10月8日的记录与最接近的11/13/2009记录匹配,则不再考虑该11/13/2009记录。
很抱歉,如果令人困惑。希望这可以更好地解释它。这就是最终结果集的样子。
df1<-data.frame(ID_NUMBER = c(7160015,22695229,22695230,7160016,7160017,22695198,7160018,22695199,7160019,22695200,7160020,22695232,7160030,22697153,22697158,7162962,22698039,22698041,7162964)
, CalNumber = c(9662.37,9662.45,9663.41,9663.44,9665.97,9666.11,9667.04,9667.1,9667.87,9668.01,9668.74,9668.79,9868.2, 72719.75,72723.21,99774,99774.03,99776.11,99776.13)
,Inspection_Date = c('11/13/2009','10/8/2014','10/8/2014','11/13/2009','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','11/13/2009','10/8/2014','10/8/2014','11/13/2009','10/8/2014','10/8/2014','11/13/2009')
,Diff = c(NA,0.08,0.03,NA,NA,0.14,NA,0.06,NA,0.14,NA,0.05,NA, NA,NA,NA,0.03,0.02,NA)
,MatchID = c(NA,7160015,7160016,NA,NA,7160017,NA,7160018,NA,7160019,NA,7160020,NA, NA,NA,NA,7162962,7162964,NA))
最终结果集还有2列。差异(abs(CalNumber)<= 1),这是基于CalNumber的最接近记录的绝对差异。 MatchID,它是最接近的记录的对应ID_Number。如果2014年10月8日不符合<= 1,则将其留空。 2009年11月13日的所有MatchID列均为空白。仅针对2014年10月8日记录中与2009年11月13日最接近的匹配项填充MatchID。
提前谢谢!
答案 0 :(得分:2)
我对data.table
来说还比较陌生,所以请忍受:
library(data.table)
dt1 <- data.table(df1)
dt2 <- copy(dt1)
setnames(dt2, c("ID_NUMBER", "CalNumber", "Inspection_Date"), c("ID_NUMBER2", "CalNumber2", "Inspection_Date2"))
dt2[dt1,
.(ID_NUMBER,
CalNumber,
Inspection_Date,
Diff = abs(CalNumber - CalNumber2),
MatchID = ID_NUMBER2),
on = .(Inspection_Date2 > Inspection_Date),
allow.cartesian = TRUE
][,
.SD[which.min(ifelse(is.na(Diff), Inf, Diff))],
by = .(ID_NUMBER, CalNumber, Inspection_Date)
][,
.(ID_NUMBER,
CalNumber,
Inspection_Date,
Diff = ifelse(Diff > 1, NA, Diff),
MatchID = ifelse(Diff > 1, NA, MatchID))
]
ID_NUMBER CalNumber Inspection_Date Diff MatchID
1: 7160015 9662.37 11/13/2009 NA NA
2: 22695229 9662.45 10/8/2014 0.08 7160015
3: 22695230 9663.41 10/8/2014 0.03 7160016
4: 7160016 9663.44 11/13/2009 NA NA
5: 7160017 9665.97 11/13/2009 NA NA
6: 22695198 9666.11 10/8/2014 0.14 7160017
7: 7160018 9667.04 11/13/2009 NA NA
8: 22695199 9667.10 10/8/2014 0.06 7160018
9: 7160019 9667.87 11/13/2009 NA NA
10: 22695200 9668.01 10/8/2014 0.14 7160019
11: 7160020 9668.74 11/13/2009 NA NA
12: 22695232 9668.79 10/8/2014 0.05 7160020
13: 7160030 9868.20 11/13/2009 NA NA
14: 22697153 72719.75 10/8/2014 NA NA
15: 22697158 72723.21 10/8/2014 NA NA
16: 7162962 99774.00 11/13/2009 NA NA
17: 22698039 99774.03 10/8/2014 0.03 7162962
18: 22698041 99776.11 10/8/2014 0.02 7162964
19: 7162964 99776.13 11/13/2009 NA NA
dt1
的复制是因为在自我联接期间我在引用列时遇到问题。我还怀疑某些操作可以合并,因此非常欢迎其他用户的输入。
第一组方括号执行dt1
与dt2
的左不等分连接,计算Diff
变量。 data.table
的左联接语法有点怪异,但是它的作用是从dt2
中获取与on
参数中指定的行匹配的所有行
第二组括号获取与组中的最小值匹配的记录。这里的值是一个稍有变化的Diff
变量(请参阅this我曾寻求帮助的帖子)
在最小NA
大于1的情况下,第三括号将Diff
分配给MatchID
和Diff
的值
答案 1 :(得分:2)
由于@zack的回答,我想我现在知道OP正在做什么。要找到最接近的匹配项,通常可以使用滚动连接:
setDT(df1)
df1[Inspection_Date == "10/8/2014", c("md", "mid") :=
df1[Inspection_Date == "11/13/2009"][.SD, on=.(CalNumber), roll="nearest",
.(abs(x.CalNumber - i.CalNumber), x.ID_NUMBER)
]
]
# oh, and then wipe it out if diff > 1
df1[md > 1, c("md", "mid") := NA]
ID_NUMBER CalNumber Inspection_Date Diff MatchID md mid
1: 7160015 9662.37 11/13/2009 NA NA NA NA
2: 22695229 9662.45 10/8/2014 0.08 7160015 0.08 7160015
3: 22695230 9663.41 10/8/2014 0.03 7160016 0.03 7160016
4: 7160016 9663.44 11/13/2009 NA NA NA NA
5: 7160017 9665.97 11/13/2009 NA NA NA NA
6: 22695198 9666.11 10/8/2014 0.14 7160017 0.14 7160017
7: 7160018 9667.04 11/13/2009 NA NA NA NA
8: 22695199 9667.10 10/8/2014 0.06 7160018 0.06 7160018
9: 7160019 9667.87 11/13/2009 NA NA NA NA
10: 22695200 9668.01 10/8/2014 0.14 7160019 0.14 7160019
11: 7160020 9668.74 11/13/2009 NA NA NA NA
12: 22695232 9668.79 10/8/2014 0.05 7160020 0.05 7160020
13: 7160030 9868.20 11/13/2009 NA NA NA NA
14: 22697153 72719.75 10/8/2014 NA NA NA NA
15: 22697158 72723.21 10/8/2014 NA NA NA NA
16: 7162962 99774.00 11/13/2009 NA NA NA NA
17: 22698039 99774.03 10/8/2014 0.03 7162962 0.03 7162962
18: 22698041 99776.11 10/8/2014 0.02 7162964 0.02 7162964
19: 7162964 99776.13 11/13/2009 NA NA NA NA
我正在根据OP的方式对特定日期进行硬编码...
我试图根据与CalNumber的最接近值(绝对差为<= 1)将2014年10月8日的记录与2009年11月13日的记录进行匹配。
...而zack的答案通常比较日期。 (请注意,为此您应使用正确的日期格式,例如df1[, Inspection_Date := as.IDate(Inspection_Date, "%m/%d/%Y")]
)
工作原理
关键部分是基于x[i, on=, roll=, j]
和x = df1[Inspection_Date == "11/13/2009"]
中的条件的2009子集i = .SD = df1[Inspection_Date == "10/8/2014"]
和2014子集on=
的联接roll=
。
在j
的{{1}}内,可以使用前缀x[i, on=, roll=, j]
和x.*
来消除常见的列名。