删除不是完美重复的冗余记录

时间:2019-05-01 14:16:44

标签: r duplicates data-manipulation

我需要从文件中删除冗余记录,但是这些冗余记录看起来不像标准重复项。 have对象是一个数据框,其中包含电视节目 Recess 的角色一起工作的学校项目的数量。有7,000个观测值。

head(have)

obs authA           authB            n_projects
1   TJ.DETWEILER    GRETCHEN.WILSON          11
2   TJ.DETWEILER    KING.BOB                  2
3   TJ.DETWEILER    ASHLEY.SPINELLI           1
4   TJ.DETWEILER    VINCE.LASALLE             3
5   GRETCHEN.WILSON TJ.DETWEILER             11
6   GRETCHEN.WILSON ASHLEY.SPINELLI           7
…   …               …                         …

显示了一个冗余记录:第一个观测值包含与第五个观测值相同的信息。作者顺序(即被列为authAauthB的作者顺序无关紧要)。我需要删除这些观察之一-无关紧要。新的数据帧want可能如下所示:

obs authA           authB            n_projects
1   TJ.DETWEILER    GRETCHEN.WILSON          11
2   TJ.DETWEILER    KING.BOB                  2
3   TJ.DETWEILER    ASHLEY.SPINELLI           1
4   TJ.DETWEILER    VINCE.LASALLE             3
6   GRETCHEN.WILSON ASHLEY.SPINELLI           7
…   …               …                         …

尽管删除第一个obs也可以。

1 个答案:

答案 0 :(得分:2)

对数据集列(“ authA”,“ authB”)进行子集设置,遍历各行,sort,然后应用duplicated创建一个逻辑矢量,并使用该逻辑矢量删除重复的行

have[!duplicated(t(apply(have[2:3], 1, sort))),]
#  obs           authA           authB n_projects
#1   1    TJ.DETWEILER GRETCHEN.WILSON         11
#2   2    TJ.DETWEILER        KING.BOB          2
#3   3    TJ.DETWEILER ASHLEY.SPINELLI          1
#4   4    TJ.DETWEILER   VINCE.LASALLE          3
#6   6 GRETCHEN.WILSON ASHLEY.SPINELLI          7

或带有pmin/pmax的选项

library(dplyr)
library(stringr)
have %>% 
   filter(!duplicated(str_c(pmin(authA, authB), pmax(authA, authB))))

数据

have <- structure(list(obs = 1:6, authA = c("TJ.DETWEILER", "TJ.DETWEILER", 
"TJ.DETWEILER", "TJ.DETWEILER", "GRETCHEN.WILSON", "GRETCHEN.WILSON"
), authB = c("GRETCHEN.WILSON", "KING.BOB", "ASHLEY.SPINELLI", 
"VINCE.LASALLE", "TJ.DETWEILER", "ASHLEY.SPINELLI"), n_projects = c(11L, 
2L, 1L, 3L, 11L, 7L)), class = "data.frame", row.names = c(NA, 
-6L))