随着时间的推移,我有多个人的数据,我试图在R
中链接到一起。问题在于个人的名字通常非常相似,但拼写略有不同,并且ID变量经常缺失(派对和地区永远不会丢失,但不足以唯一地描述个人)。以下是3个不同个体的例子,所有人都以Ennis作为姓氏:
df = structure(list(chamber = c("H", "H", "H", "H", "H", "H", "H",
"H", "H", "S", "S", "S", "S", "S"), year = c("2005", "2007",
"1997", "1999", "2001", "1995", "1997", "1999", "2001", "2007",
"2011", "2012", "2013", "2013"), name = c("Ennis", "Ennis", "Ennis, B",
"Ennis, B", "Ennis, B", "Ennis, D", "Ennis, D", "Ennis, D", "Ennis, D",
"Ennis", "Ennis, Bruce", "Ennis, Bruce", "Ennis, Bruce", "Ennis, J"
), party = c("100", "100", "100", "100", "100", "200", "200",
"200", "200", "100", "100", "100", "100", "100"), district = c("028",
"028", "028", "028", "028", "006", "006", "006", "006", "014",
"014", "014", "014", "007"), os.id = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, "DEL000009", "DEL000009", "DEL000009", NA), msp.id = c("1298",
"1298", NA, NA, "1298", NA, NA, NA, "13676", NA, "1298", "1298",
"1298", "567")), .Names = c("chamber", "year", "name", "party",
"district", "os.id", "msp.id"), row.names = c(NA, 14L), class = "data.frame")
其中描述了此示例数据框:
chamber year name party district os.id msp.id
1 H 2005 Ennis 100 028 <NA> 1298
2 H 2007 Ennis 100 028 <NA> 1298
3 H 1997 Ennis, B 100 028 <NA> <NA>
4 H 1999 Ennis, B 100 028 <NA> <NA>
5 H 2001 Ennis, B 100 028 <NA> 1298
6 H 1995 Ennis, D 200 006 <NA> <NA>
7 H 1997 Ennis, D 200 006 <NA> <NA>
8 H 1999 Ennis, D 200 006 <NA> <NA>
9 H 2001 Ennis, D 200 006 <NA> 13676
10 S 2007 Ennis 100 014 <NA> <NA>
11 S 2011 Ennis, Bruce 100 014 DEL000009 1298
12 S 2012 Ennis, Bruce 100 014 DEL000009 1298
13 S 2013 Ennis, Bruce 100 014 DEL000009 1298
14 S 2013 Ennis, J 100 007 <NA> 567
因此,观察1-5和10-14描述“Ennis,B”,观察6-9描述“Ennis,D”,观察14描述“Ennis,J”。我通过在脑海中逻辑拼接多个字段来推断出这一点。当然,我想自动化这个,因为我有数十万个这样的观察。最终,我想为所有这14个观察结果分配一个唯一的,无遗失的ID。在这种情况下,这将是3个唯一ID。
我做了一些研究,我认为R中的RecordLinkage
包可以做我需要的,以及名称的一些模糊字符串匹配。问题是我想要使用的阻塞变量party
,os.id
和msp.id
并不总是存在。也就是说,阻塞变量需要完全匹配。这是一个问题,因为例如观察3和观察11都描述了同一个人但是3对于阻塞变量具有NA。
由于缺少阻塞变量,以下是我正在修补的代码:
rpairsfuzzy <- compare.dedup(df,blockfld = c(4,6,7), strcmp = TRUE)
dim(rpairsfuzzy$pairs)
rpairsfuzzy$pairs
它仅将观察11-13识别为匹配,因为它们都包含非缺失数据。但显然这是错的。