我需要从文件中删除冗余记录,但是这些冗余记录看起来不像标准重复项。 have
对象是一个数据框,其中包含电视节目 Recess 的角色一起工作的学校项目的数量。有7,000个观测值。
head(have)
obs authA authB n_projects
1 TJ.DETWEILER GRETCHEN.WILSON 11
2 TJ.DETWEILER KING.BOB 2
3 TJ.DETWEILER ASHLEY.SPINELLI 1
4 TJ.DETWEILER VINCE.LASALLE 3
5 GRETCHEN.WILSON TJ.DETWEILER 11
6 GRETCHEN.WILSON ASHLEY.SPINELLI 7
… … … …
显示了一个冗余记录:第一个观测值包含与第五个观测值相同的信息。作者顺序(即被列为authA
或authB
的作者顺序无关紧要)。我需要删除这些观察之一-无关紧要。新的数据帧want
可能如下所示:
obs authA authB n_projects
1 TJ.DETWEILER GRETCHEN.WILSON 11
2 TJ.DETWEILER KING.BOB 2
3 TJ.DETWEILER ASHLEY.SPINELLI 1
4 TJ.DETWEILER VINCE.LASALLE 3
6 GRETCHEN.WILSON ASHLEY.SPINELLI 7
… … … …
尽管删除第一个obs也可以。
答案 0 :(得分:2)
对数据集列(“ authA”,“ authB”)进行子集设置,遍历各行,sort
,然后应用duplicated
创建一个逻辑矢量,并使用该逻辑矢量删除重复的行>
have[!duplicated(t(apply(have[2:3], 1, sort))),]
# obs authA authB n_projects
#1 1 TJ.DETWEILER GRETCHEN.WILSON 11
#2 2 TJ.DETWEILER KING.BOB 2
#3 3 TJ.DETWEILER ASHLEY.SPINELLI 1
#4 4 TJ.DETWEILER VINCE.LASALLE 3
#6 6 GRETCHEN.WILSON ASHLEY.SPINELLI 7
或带有pmin/pmax
的选项
library(dplyr)
library(stringr)
have %>%
filter(!duplicated(str_c(pmin(authA, authB), pmax(authA, authB))))
have <- structure(list(obs = 1:6, authA = c("TJ.DETWEILER", "TJ.DETWEILER",
"TJ.DETWEILER", "TJ.DETWEILER", "GRETCHEN.WILSON", "GRETCHEN.WILSON"
), authB = c("GRETCHEN.WILSON", "KING.BOB", "ASHLEY.SPINELLI",
"VINCE.LASALLE", "TJ.DETWEILER", "ASHLEY.SPINELLI"), n_projects = c(11L,
2L, 1L, 3L, 11L, 7L)), class = "data.frame", row.names = c(NA,
-6L))