这是之前在SO中回答的问题的略微变化。 (Unique on a dataframe with only selected columns)
与该问题和我的唯一区别是我必须提到应该保留重复项中的哪些特定行。我的行是我正在考虑的名称,例如给一个子字符串来删除具有该子字符串的行,但我无法将其放入代码中。例如:如果重复的行是exm123和tre123,我想保留带有子子串的那些)
如果你们认为没有任何子串,有更简单的方法在R中做同样的事情,我很乐意学习替代方案。谢谢。
dat:
Index Name id1 id2
1 exm-9980 1 202183358
2 exm-53487 1 203186865
3 exm-tre10248 1 85537661
4 exm-7747 10 102827758
5 exm-29639 10 18289634
6 exm-76467 10 27436462
7 exm-tre7540 10 18289634
8 exm-4560589 10 74890584
9 vg-194357 11 102589148
10 exm-0867390 11 61110815
11 exm-IN3127 1 85537661
12 exm-tre2315 11 18632984
13 exm-12411 6 30332555
14 exm-128711 11 18632984
nm1 <- c('id1', 'id2')
indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1],fromLast=TRUE)
df22=dat[!indx|(indx & grepl("^tre", dat$Name)),]
which(indx==T)
indx: 3,5,7,12.14,11,13
当我使用来自索引13的主数据的id1和id2的值进行交叉检查时
F1 = DAT [DAT $ ID1 == 6安培; DAT $ ID2 == 30332555,]
f1是1行的矩阵。如果它是重复的,它应该是第2行或更多行的矩阵。
我无法加载完整数据,因为它超过100k行。但我希望这有助于以明确的方式显示问题。
答案 0 :(得分:0)
使用示例数据集:
nm1 <- c('id1', 'id2')
indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1], fromLast=TRUE)
dat[!indx|(indx & grepl("^tre", dat$Name)),]
# Index Name id1 id2
#1 1 exm-49980 1 2021
#2 2 exm-3487 1 20318
#3 3 exm-0248 1 8553
#4 4 exm-17747 10 102827
#5 5 exm-29639 10 18289
#7 7 tre-2987 10 27436
#8 8 vg-18999 18 279990
dat <- structure(list(Index = 1:8, Name = c("exm-49980", "exm-3487",
"exm-0248", "exm-17747", "exm-29639", "exm-6467", "tre-2987",
"vg-18999"), id1 = c(1L, 1L, 1L, 10L, 10L, 10L, 10L, 18L), id2 = c(2021L,
20318L, 8553L, 102827L, 18289L, 27436L, 27436L, 279990L)), .Names = c("Index",
"Name", "id1", "id2"), class = "data.frame", row.names = c(NA,
-8L))
nm1 <- c('id1', 'id2')
indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1], fromLast=TRUE)
dat1 <- dat[!indx|(indx&grepl("-tre", dat$Name)),] #check the `grepl`. The pattern is changed as per the new example. Here, the `Name` no longer starts with `tre`.
dat1
# Index Name id1 id2
#1 1 exm-9980 1 202183358
#2 2 exm-53487 1 203186865
#3 3 exm-tre10248 1 85537661
#4 4 exm-7747 10 102827758
#6 6 exm-76467 10 27436462
#7 7 exm-tre7540 10 18289634
#8 8 exm-4560589 10 74890584
#9 9 vg-194357 11 102589148
#10 10 exm-0867390 11 61110815
#12 12 exm-tre2315 11 18632984
#13 13 exm-12411 6 30332555
dat <- structure(list(Index = 1:14, Name = c("exm-9980", "exm-53487",
"exm-tre10248", "exm-7747", "exm-29639", "exm-76467", "exm-tre7540",
"exm-4560589", "vg-194357", "exm-0867390", "exm-IN3127", "exm-tre2315",
"exm-12411", "exm-128711"), id1 = c(1L, 1L, 1L, 10L, 10L, 10L,
10L, 10L, 11L, 11L, 1L, 11L, 6L, 11L), id2 = c(202183358L, 203186865L,
85537661L, 102827758L, 18289634L, 27436462L, 18289634L, 74890584L,
102589148L, 61110815L, 85537661L, 18632984L, 30332555L, 18632984L
)), .Names = c("Index", "Name", "id1", "id2"), class = "data.frame", row.names = c(NA,
-14L))