我希望根据多个列识别我的数据集中的重复记录,查看记录,并保留R中包含最完整数据的记录。我想保留与之关联的行填充了最大数据点数的每个名称。在日期列的情况下,我还想将无效日期视为缺失。我的数据如下:
df<-data.frame(Record=c(1,2,3,4,5),
First=c("Ed","Sue","Ed","Sue","Ed"),
Last=c("Bee","Cord","Bee","Cord","Bee"),
Address=c(123,NA,NA,456,789),
DOB=c("12/6/1995","0056/12/5",NA,"12/5/1956","10/4/1980"))
Record First Last Address DOB
1 Ed Bee 123 12/6/1995
2 Sue Cord 0056/12/5
3 Ed Bee
4 Sue Cord 456 12/5/1956
5 Ed Bee 789 10/4/1980
所以在这种情况下我会保留记录1,4和5.大约有85000条记录和130个变量,所以如果有办法系统地这样做,我会很感激帮助。另外,我是一个R新手(好像你不能告诉),所以任何解释也是值得赞赏的。谢谢!
答案 0 :(得分:0)
#Add a new column to the dataframe containing the number of NA values in each row.
df$nMissing <- apply(df,MARGIN=1,FUN=function(x) {return(length(x[which(is.na(x))]))})
#Using ave, find the indices of the rows for each name with min nMissing
#value and use them to filter your data
deduped_df <-
df[which(df$nMissing==ave(df$nMissing,paste(df$First,df$Last),FUN=min)),]
#If you like, remove the nMissinig column
df$nMissing<-deduped_df$nMissing<-NULL
deduped_df
Record First Last Address DOB 1 1 Ed Bee 123 12/6/1995 4 4 Sue Cord 456 12/5/1956 5 5 Ed Bee 789 10/4/1980
编辑:根据您的评论,如果您还希望过滤无效的DOB,您可以将列转换为日期格式,这将自动将无效日期视为NA(缺少数据)。
df$DOB<-as.Date(df$DOB,format="%m/%d/%Y")