我很难过。如何根据行是否匹配将数据复制到ID变量中的另一行。我正在使用数以千计的历史地址,并非所有地址都完全匹配。但是任何差异通常都在地址的末尾,因此使用该值的前4或5个字符应该处理它。我想用适当的道路代码填写NA。我一直在尝试使用dplyr解决方案而不是在任何地方。任何想法都会非常感激。
ID<-c(50,50,50,71,71,71)
ID_Y<-c(505,506,507,715,716,717)
address<-c("325 Park St N","325 Park St","325 Park","616 Holly","616 Holly Dr","510 Walnut Dr")
tract<-c(110,NA,NA,223,NA,989)
AD567<-data.frame(ID,ID_Y,address,tract)
AD567
ID ID_Y address tract
1 50 505 325 Park St N 110
2 50 506 325 Park St NA
3 50 507 325 Park NA
4 71 715 616 Holly 223
5 71 716 616 Holly Dr NA
6 71 717 510 Walnut Dr 989
试图到达这里:
ID ID_Y address tract
1 50 505 325 Park St N 110
2 50 506 325 Park St 110
3 50 507 325 Park 110
4 71 715 616 Holly 223
5 71 716 616 Holly Dr 223
6 71 717 510 Walnut Dr 989
答案 0 :(得分:1)
这是一个没有任何额外库的解决方案
# introduce an additional column which serves as heuristic key
AD567$prefix = substr(AD567$address, 1, 8)
# extract all records which have a tract code
TRACT = AD567[! is.na(AD567$tract),c("prefix", "tract")]
# check if the record is unique per prefix
aggregate(tract ~ prefix, TRACT, length)
# ... one may use only those records further on which are unique ...
# merge both data frames to inject the tract code; make sure nothing
# is lost from AD567
AD567 = merge(AD567, TRACT, by="prefix", suffixes = c("", ".ref"), all.x = TRUE)
# copy over tract code
AD567$tract = AD567$tract.ref
# remove utility columns
AD567 = AD567[, ! colnames(AD567) %in% c("prefix", "tract.ref")]
请记住,这是一个非常糟糕的启发式方法。不精确或模糊的数据匹配本身就是一门科学。