RecordLinkage:更改分析单位,如何将重复数据删除数据集中的所有链接匹配导出到新数据帧的一行中?

时间:2019-05-20 19:35:00

标签: r record-linkage

我正在尝试将数据集的分析单位从报告的事件更改为报告事件的个人。由于同一个人已报告多次,因此我使用了R的RecordLinkage包中的compare.dedup函数来识别匹配对,即同一个人报告的事件对。但是,我正在努力将所有这些对导出到一个数据集中以进行进一步分析。

以下是用于伪数据的代码:

incidents <- structure(
  list(
    date = as.Date(c("01-02-2014", "02-02-2014", "02-02-2014", "03-02-2014", "04-02-2014","05-02-2014"), format = "%d-%m-%Y"),
    first_name = structure(c(1L, 2L, 3L, 4L, 5L, 1L), 
                         .Label = c("Dave", "Joe", "David", "Joseph", "Jo","Dave"),
                         class = "factor"),
    last_name = structure(c(1L, 2L, 3L, 4L, 5L, 1L),
                      .Label = c("Evans", "Miles", "Evans", "Myles",
                                 "Doe","Evans"), 
                      class = "factor"),
    sex = structure(c(1L, 1L, 1L, 1L, 2L, 1L), 
                  .Label = c("Male", "Female"), class = "factor"),
    dob = as.Date(c("14-02-1988", "01-05-1987", "14-02-1988", "01-05-1987", "04-02-1999","14-02-1988"), format = "%d-%m-%Y")),
  .Names = c("Date","Name","Surname","Sex","DOB"),
  class = "data.frame", row.names = c(NA, -6L)
)

打印时,“事件”如下所示:

        Date   Name Surname    Sex        DOB
1 2014-02-01   Dave   Evans   Male 1988-02-14
2 2014-02-02    Joe   Miles   Male 1987-05-01
3 2014-02-02  David   Evans   Male 1988-02-14
4 2014-02-03 Joseph   Myles   Male 1987-05-01
5 2014-02-04     Jo     Doe Female 1999-02-04
6 2014-02-05   Dave   Evans   Male 1988-02-14

我设法在单行中打印对,但是我要做的是将所有群集成一行(见下文)。

我已运行以下代码来识别和提取匹配的对:

# Generating the pairs

pairs = compare.dedup(incidents,
                      identity = NA, 
                      blockfld = FALSE,
                      phonetic = c(2), #runs phonetic comparison
                      phonfun = pho_h,
                      strcmp = c(3,4,5), #runs a string comparison
                      strcmpfun = levenshteinSim, #use levenshtein distance
                      exclude = c(1))

# Generating the weights
weightedpairs = emWeights(pairs, cutoff = 0.7)

#Classify the pairs
emresult = emClassify(weightedpairs)

我可以在单行中获得链接对:

links=getPairs(emresult,show="links", single.rows=TRUE)

links

    id1     Date.1 Name.1 Surname.1 Sex.1      DOB.1 id2     Date.2 Name.2 Surname.2 Sex.2      DOB.2    Weight
1.1   1 2014-02-01   Dave     Evans  Male 1988-02-14   6 2014-02-05   Dave     Evans  Male 1988-02-14 20.876240
1     1 2014-02-01   Dave     Evans  Male 1988-02-14   3 2014-02-02  David     Evans  Male 1988-02-14 10.208543
3     3 2014-02-02  David     Evans  Male 1988-02-14   6 2014-02-05   Dave     Evans  Male 1988-02-14 10.208543
2     2 2014-02-02    Joe     Miles  Male 1987-05-01   4 2014-02-03 Joseph     Myles  Male 1987-05-01  9.886615

但是,我想要实现的是合并所有匹配项,因此我最终在报告日期前每人只有一行。差不多是这样的:

        Date   Name Surname    Sex        DOB    Date2       Name2    Surname2    Sex2    DOB2    Date3    Name3    Surname3    Sex3    DOB3
1 2014-02-01   Dave   Evans   Male 1988-02-14    2014-02-02  David    Evans Male    1988-02-14    2014-02-05   Dave   Evans   Male 1988-02-14
2 2014-02-02    Joe   Miles   Male 1987-05-01    2014-02-03  Joseph   Myles Male    1987-05-01
3 2014-02-04     Jo     Doe Female 1999-02-04    NA          NA       NA

我想知道是否有人对如何实现这一目标有建议?

谢谢!

0 个答案:

没有答案