fastLink 和 RecordLinkage 软件包在将记录A(行)从数据库A匹配到数据库B以及反之亦然方面表现出色。开发人员正在努力从仅匹配2个数据库扩展到多个数据库。 我给了here的简单例子。
同时,我们将如何匹配多个数据帧? 例如,如果我有来自诊所A,B,C,D,E,F的患者的多个医疗记录,而我想将它们合并为一个。
dfA <-
structure(list(fname = c("Jafar", "Nemo", "Simba", "Belle", "Nala",
"Jasmine"), lname = c("Evil", "Water", "King", "Beauty", "Princess",
"Princess"), gender = c("M", "M", "M", "F", "F", "F"), dob = c(1987,
2000, 2011, 1989, 1970, 1989), city = c("Arabtown", "Atlantic",
"Sahara", "Nice", "Sahara", "Arabtown")), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
dfB <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Nala", "Jasmine"), lname = c("Evil", "Waterson", "King", "Beauty",
"Princess", "Princess of Arabtown"), gender = c("M", "M", "M",
"F", "F", "F"), dob = c(NA, 2000, 2011, NA, NA, 1989), city = c("Arabtown",
"Atlantica", "Sahara", "Nice-France", "Sahara", "Arabia")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
dfC <-
structure(list(fname = c("Jafar Jr", "Fishy", "Lion", "Belle",
"Sarabi", "Jasmine"), lname = c("Evil", "Waterpal", "King", "Beauty",
"Queen", NA), gender = c("M", "M", NA, "F", "F", "F"), dob = c(NA,
2000, 2011, NA, 1940, 1989), city = c("Arabia", NA, "Sahara",
"France", "Sahara", NA)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfD <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Sarabi", "Jasmine"), lname = c("Evil", "Waterson", "King", "Beast",
"Queen", "Evil"), gender = c("M", "M", "M", "F", "F", "M"), dob = c(NA,
2000, 2011, 1989, NA, 1989), city = c("Arabtown", "Atlantica",
"Sahara", NA, "Sahara", "Arabtown")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfE <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Nala", "Aladdin"), lname = c("Evil", "Pateron", NA, "Gaston",
NA, "Streetrat"), gender = c("M", NA, "M", "F", "F", "M"), dob = c(1987,
NA, NA, NA, 1970, 1989), city = c("Arabtown", "Atlantica", "Sahara",
"France", "Sahara", "Arabia")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
dfF <-
structure(list(fname = c("Jafar Jr", "Nemo", "Simba", "Belle",
"Nala", "Al"), lname = c("Evil", "Waterson", "Dead", "Beauty",
"Princess", "Streetrat"), gender = c("M", "M", NA, "F", "F",
"M"), dob = c(1987, 2000, 2011, NA, NA, 1989), city = c("Arabia",
"Atlantic", "Sahara", "Nice-France", "Sahara", "Arabia")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
最后,我想要唯一标识的记录:
1 Jafar Evil M 1987 Arabtown
2 Nemo Water M 2000 Atlantic
3 Simba King M 2011 Sahara
4 Belle Beauty F 1989 Nice
5 Nala Princess F 1970 Sahara
6 Jasmine Princess F 1989 Arabtown
7 Sarabi Queen F 1940 Sahara
8 Aladdin Streetrat M 1989 Arabia
即使结果不像上面那么干净,也可以。 目标是从所有6条记录中找到一个统一记录,并且属于同一实体。 fastLink 和 RecordLinkage 都负责重复数据删除(删除重复项)。
我需要帮助的是关于处理两个以上数据库的想法/方法/方法。