Question

我有一组20个左右的连续个人级横截面数据集，我想将它们链接在一起。

不幸的是，没有时间稳定的身份证号码;然而，我认为，有第一个，最后一个和少女名字以及出生年份的字段 - 这应该允许相当高的（90-95％）匹配率。

理想情况下，我会为每个独特的个人创建一个与时间无关的ID。

我可以为那些婚姻状况（婚前姓名）在R中很容易改变的人 - 堆叠数据集以获得一个长面板，然后做一些有效的事情：

unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.I]

（我当然使用R data.table），然后合并回完整数据。

但是，我坚持如何将婚前姓名纳入此程序。有什么建议吗？

以下是数据的预览：

       first_name     last_name       nee birth_year year
    1:     eileen      aaldxxxx     dxxxx       1977 2002
    2:     eileen      aaldxxxx     dxxxx       1977 2003
    3:      sarah        aaxxxx    gexxxx       1974 2003
    4:      kelly        aaxxxx     nxxxx       1951 2008
    5:      linda aarxxxx-gxxxx   aarxxxx       1967 2008
   ---                                                   
72008:     stacey      zwirxxxx   kruxxxx       1982 2010
72009:     stacey      zwirxxxx   kruxxxx       1982 2011
72010:     stacey      zwirxxxx   kruxxxx       1982 2012
72011:     stacey      zwirxxxx   kruxxxx       1982 2013
72012:       jill      zydoxxxx gundexxxx       1978 2002

更新：

我已经做了很多切削和锤击问题;这是我到目前为止所得到的。对于目前为止可能对代码进行改进的任何意见，我将不胜感激。

由于不准确的匹配（"tonya"与"tanya"，"jenifer"与"jennifer"），我仍然完全错过了3-5％的匹配内容。我没有想出一个干净的方法来对落后者进行模糊匹配，所以如果有人有一种直截了当的方法来实现它，那么就可以在这个方向上更好地匹配。

基本方法是累积建立 - 在第一年分配ID，然后在第二年寻找匹配;将新ID分配给不匹配的。然后在第3年，回顾前2年等等。关于如何匹配，我们的想法是慢慢扩展匹配标准 - 这个想法是匹配越强大，意外失配的可能性就越小（特别担心John Smith s。。

不用多说，这里是匹配一对数据集的主要功能：

get_id<-function(yr,key_from,key_to=key_from,
                 mdis,msch,mard,init,mexp,step){
  #Want to exclude anyone who is matched
  existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]
  #Get the most recent prior observation of all
  #  unmatched teachers, excluding those teachers
  #  who cannot be uniquely identified by the
  #  current key setting
  unmatched<-
    full_data[.(1996:(yr-1))
              ][!teacher_id %in% existing_ids,
                .SD[.N],by=teacher_id,
                .SDcols=c(key_from,"teacher_id")
                ][,if (.N==1L) .SD,keyby=key_from
                  ][,(flags):=list(mdis,msch,mard,init,mexp,step)]
  #Merge, reset keys
  setkey(setkeyv(
    full_data,key_to)[year==yr&is.na(teacher_id),
                      (update_cols):=unmatched[.SD,update_cols,with=F]],
    year)
  full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
                                        by=id,.SDcols=update_cols]
}

然后我基本上在yy循环中经历了19年for，逐渐运行12个更松散的匹配，例如：第3步是：

get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
       mdis=T,msch=T,mard=F,init=F,mexp=F,step=3L)

最后一步是分配新的ID：

current_max<-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids<-
  setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))
                   ][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),
                            teacher_id:=new_ids[.SD,add_id]],year)

使用姓名全名和婚前姓名字符串（和生日）来匹配不同时间的个人

0 个答案: