从仅包含选定列的数据框中的重复项中选取特定行

时间:2014-10-15 03:18:01

标签: r unique

这是之前在SO中回答的问题的略微变化。 (Unique on a dataframe with only selected columns

与该问题和我的唯一区别是我必须提到应该保留重复项中的哪些特定行。我的行是我正在考虑的名称,例如给一个子字符串来删除具有该子字符串的行,但我无法将其放入代码中。例如:如果重复的行是exm123和tre123,我想保留带有子子串的那些)

如果你们认为没有任何子串,有更简单的方法在R中做同样的事情,我很乐意学习替代方案。谢谢。

  dat:    
  Index Name      id1   id2
  1 exm-9980        1   202183358
  2 exm-53487       1   203186865
  3 exm-tre10248    1   85537661
  4 exm-7747       10   102827758
  5 exm-29639      10   18289634
  6 exm-76467      10   27436462
  7 exm-tre7540    10   18289634
  8 exm-4560589    10   74890584
  9 vg-194357      11   102589148
  10 exm-0867390   11   61110815
  11 exm-IN3127     1   85537661
  12 exm-tre2315   11   18632984
  13 exm-12411      6   30332555
  14 exm-128711    11   18632984

nm1 <- c('id1', 'id2')           
indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1],fromLast=TRUE)    
df22=dat[!indx|(indx & grepl("^tre", dat$Name)),]    
which(indx==T)       

indx: 3,5,7,12.14,11,13        

当我使用来自索引13的主数据的id1和id2的值进行交叉检查时       F1 = DAT [DAT $ ID1 == 6安培; DAT $ ID2 == 30332555,]
f1是1行的矩阵。如果它是重复的,它应该是第2行或更多行的矩阵。

我无法加载完整数据,因为它超过100k行。但我希望这有助于以明确的方式显示问题。

1 个答案:

答案 0 :(得分:0)

使用示例数据集:

 nm1 <- c('id1', 'id2')
 indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1], fromLast=TRUE)

dat[!indx|(indx & grepl("^tre", dat$Name)),] 
 #   Index      Name id1    id2
 #1     1 exm-49980   1   2021
 #2     2  exm-3487   1  20318
 #3     3  exm-0248   1   8553
 #4     4 exm-17747  10 102827
 #5     5 exm-29639  10  18289
 #7     7  tre-2987  10  27436
 #8     8  vg-18999  18 279990

数据

 dat <- structure(list(Index = 1:8, Name = c("exm-49980", "exm-3487", 
 "exm-0248", "exm-17747", "exm-29639", "exm-6467", "tre-2987", 
 "vg-18999"), id1 = c(1L, 1L, 1L, 10L, 10L, 10L, 10L, 18L), id2 = c(2021L, 
 20318L, 8553L, 102827L, 18289L, 27436L, 27436L, 279990L)), .Names = c("Index", 
 "Name", "id1", "id2"), class = "data.frame", row.names = c(NA, 
-8L))

更新

 nm1 <- c('id1', 'id2')
 indx <- duplicated(dat[,nm1])|duplicated(dat[,nm1], fromLast=TRUE)

 dat1 <- dat[!indx|(indx&grepl("-tre", dat$Name)),] #check the `grepl`.  The pattern is changed as per the new example.  Here, the `Name` no longer starts with `tre`.
 dat1
 #    Index         Name id1       id2
 #1      1     exm-9980   1 202183358
 #2      2    exm-53487   1 203186865
 #3      3 exm-tre10248   1  85537661
 #4      4     exm-7747  10 102827758
 #6      6    exm-76467  10  27436462
 #7      7  exm-tre7540  10  18289634
 #8      8  exm-4560589  10  74890584
 #9      9    vg-194357  11 102589148
 #10    10  exm-0867390  11  61110815
 #12    12  exm-tre2315  11  18632984
 #13    13    exm-12411   6  30332555

数据

 dat <- structure(list(Index = 1:14, Name = c("exm-9980", "exm-53487", 
 "exm-tre10248", "exm-7747", "exm-29639", "exm-76467", "exm-tre7540", 
 "exm-4560589", "vg-194357", "exm-0867390", "exm-IN3127", "exm-tre2315", 
 "exm-12411", "exm-128711"), id1 = c(1L, 1L, 1L, 10L, 10L, 10L, 
 10L, 10L, 11L, 11L, 1L, 11L, 6L, 11L), id2 = c(202183358L, 203186865L, 
 85537661L, 102827758L, 18289634L, 27436462L, 18289634L, 74890584L, 
 102589148L, 61110815L, 85537661L, 18632984L, 30332555L, 18632984L
 )), .Names = c("Index", "Name", "id1", "id2"), class = "data.frame", row.names = c(NA, 
 -14L))