我正在尝试将数据集的分析单位从报告的事件更改为报告事件的个人。由于同一个人已报告多次,因此我使用了R的RecordLinkage包中的compare.dedup函数来识别匹配对,即同一个人报告的事件对。但是,我正在努力将所有这些对导出到一个数据集中以进行进一步分析。
以下是用于伪数据的代码:
incidents <- structure(
list(
date = as.Date(c("01-02-2014", "02-02-2014", "02-02-2014", "03-02-2014", "04-02-2014","05-02-2014"), format = "%d-%m-%Y"),
first_name = structure(c(1L, 2L, 3L, 4L, 5L, 1L),
.Label = c("Dave", "Joe", "David", "Joseph", "Jo","Dave"),
class = "factor"),
last_name = structure(c(1L, 2L, 3L, 4L, 5L, 1L),
.Label = c("Evans", "Miles", "Evans", "Myles",
"Doe","Evans"),
class = "factor"),
sex = structure(c(1L, 1L, 1L, 1L, 2L, 1L),
.Label = c("Male", "Female"), class = "factor"),
dob = as.Date(c("14-02-1988", "01-05-1987", "14-02-1988", "01-05-1987", "04-02-1999","14-02-1988"), format = "%d-%m-%Y")),
.Names = c("Date","Name","Surname","Sex","DOB"),
class = "data.frame", row.names = c(NA, -6L)
)
打印时,“事件”如下所示:
Date Name Surname Sex DOB
1 2014-02-01 Dave Evans Male 1988-02-14
2 2014-02-02 Joe Miles Male 1987-05-01
3 2014-02-02 David Evans Male 1988-02-14
4 2014-02-03 Joseph Myles Male 1987-05-01
5 2014-02-04 Jo Doe Female 1999-02-04
6 2014-02-05 Dave Evans Male 1988-02-14
我设法在单行中打印对,但是我要做的是将所有群集成一行(见下文)。
我已运行以下代码来识别和提取匹配的对:
# Generating the pairs
pairs = compare.dedup(incidents,
identity = NA,
blockfld = FALSE,
phonetic = c(2), #runs phonetic comparison
phonfun = pho_h,
strcmp = c(3,4,5), #runs a string comparison
strcmpfun = levenshteinSim, #use levenshtein distance
exclude = c(1))
# Generating the weights
weightedpairs = emWeights(pairs, cutoff = 0.7)
#Classify the pairs
emresult = emClassify(weightedpairs)
我可以在单行中获得链接对:
links=getPairs(emresult,show="links", single.rows=TRUE)
links
id1 Date.1 Name.1 Surname.1 Sex.1 DOB.1 id2 Date.2 Name.2 Surname.2 Sex.2 DOB.2 Weight
1.1 1 2014-02-01 Dave Evans Male 1988-02-14 6 2014-02-05 Dave Evans Male 1988-02-14 20.876240
1 1 2014-02-01 Dave Evans Male 1988-02-14 3 2014-02-02 David Evans Male 1988-02-14 10.208543
3 3 2014-02-02 David Evans Male 1988-02-14 6 2014-02-05 Dave Evans Male 1988-02-14 10.208543
2 2 2014-02-02 Joe Miles Male 1987-05-01 4 2014-02-03 Joseph Myles Male 1987-05-01 9.886615
但是,我想要实现的是合并所有匹配项,因此我最终在报告日期前每人只有一行。差不多是这样的:
Date Name Surname Sex DOB Date2 Name2 Surname2 Sex2 DOB2 Date3 Name3 Surname3 Sex3 DOB3
1 2014-02-01 Dave Evans Male 1988-02-14 2014-02-02 David Evans Male 1988-02-14 2014-02-05 Dave Evans Male 1988-02-14
2 2014-02-02 Joe Miles Male 1987-05-01 2014-02-03 Joseph Myles Male 1987-05-01
3 2014-02-04 Jo Doe Female 1999-02-04 NA NA NA
我想知道是否有人对如何实现这一目标有建议?
谢谢!