在保留所有列的同时删除重复的全向组合

时间:2017-07-18 12:30:01

标签: r duplicates

我需要删除组(ID)中两列(feedID和feedID2)的重复组合,同时在数据集中保留大量其他列。应删除所有带有重复项的行,无论它是第2列中的A还是第3列中的B,反之亦然。 另外,我想删除两列中存在例如A的所有行,或者其中一列中存在NA的行。 我不能对列之间的数据进行排序,即如果A在列nr 2中,它应该保留在列nr 2中。

我知道这可能是一个重复的问题,但其他答案似乎都不适用于我的数据集,或者要求同样的事情。 例如。  Finding unique combinations irrespective of position Removing duplicate combinations in R (irrespective of order)

 test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
                      feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),
                    feedID2 = c("A1", "G2", "A1", "G2", "NA", "D1", "D2",  "NA" ))

 desiredoutput <- data.frame(ID= c("49V", "52V", "52V"),
                      feedID = c("A1","B1", "D1" ),
                    feedID2 = c("G2", "D1", "D2" ))

如果在不同的列

中,以下代码不会删除重复项
   test2 <- test [!duplicated(test[,c("ID","feedID", "feedID2")]),]

此代码根本不执行任何操作,但不会引发错误

  test2 <-  test%>% distinct(1,2,3) # where numbers refer to the columns

此代码产生错误,对于dimnames,不确定这意味着什么。我不知道我的测试数据,我不知道为什么,不能重现错误......

  indx <- !duplicated(t(apply(test, 1, sort))) # finds non - duplicates in sorted rows
   test[indx, ] 

有什么想法吗?

3 个答案:

答案 0 :(得分:1)

您的数据,但"NA"已更改为NAstringsAsFactors=F

test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
                   feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),
                   feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2",  NA ),
                   stringsAsFactors=F)

 library(dplyr)
 test %>% 
  filter(complete.cases(.)) %>%             # Remove rows with NA
  rowwise() %>%                             # Perform next step by row
  mutate(dup=paste0(sort(c(feedID,feedID2)),collapse="")) %>%   # Sort and combine feedID and feedID2
  ungroup() %>%
  group_by(ID) %>%                             # Remove rowwise grouping
  mutate(dup=duplicated(dup)) %>%           # Find duplicated feedID:feedID2 pairs
  filter(dup==F) %>%                        # Remove duplicated pairs
  filter(!(feedID==feedID2)) %>%            # Remove where feedID == feedID2
  select(-dup)                              # Remove dummy column


     ID feedID feedID2
1   49V     A1      G2
2   52V     B1      D1
3   52V     D1      D2

如果您只想在NA&amp;中寻找feedID feedID2

filter(complete.cases(.))替换为filter(!is.na(feedID) & !is.na(feedID2))

答案 1 :(得分:0)

这是一个基本解决方案,使用complete.cases功能,还可以创建排序feedID列:

# remove any rows with NA values
test <- test[complete.cases(test[,c('ID', 'feedID','feedID2')]),]
#remove any rows with feedID == feedID2
test <- test[!(test$feedID == test$feedID2),]
# add new feedID3 column
test$feedID3 <- apply(test, 1, function(x) paste(sort(c(x[2], x[3])), collapse = '-'))
# remove any duplicates, and remove last column
test[!duplicated(test[,c('feedID3', 'ID')]), -4]


   ID feedID feedID2
2 49V     A1      G2
6 52V     B1      D1
7 52V     D1      D2

数据

请注意,我们已将"NA"转换为NA,我们还设置了stringsAsFactors = TRUE

test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
                   feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),
                   feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2",  NA ),
                   stringsAsFactors = FALSE)

答案 2 :(得分:0)

将“NA”更改为NA,并设置stringsAsFactors = F

library(dplyr)
library(stringr)

test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
                   feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1",  "D2" ),
                   feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2",  NA ),
                   stringsAsFactors = F)

desiredoutput <- data.frame(ID= c("49V", "52V", "52V"),
                            feedID = c("A1","B1", "D1" ),
                            feedID2 = c("G2", "D1", "D2" ),
                            stringsAsFactors = F)

test %>% 
  # Remove NAs and all rows where the IDs are equal
  filter(!is.na(feedID),                       
         !is.na(feedID2),                      
         feedID != feedID2) %>%                
  # Group rowwise and create a sorted pair of the two ID columns
  rowwise() %>%                                
  mutate(revCheck = str_c(str_sort(c(feedID, feedID2)), collapse = "")) %>% 
  ungroup() %>% 
  # Find distinct ID pairs and keep all variables
  distinct(revCheck,
           .keep_all = T) %>% 
  # Find distinct rows for each ID pair. I kept these separate because I
  # think that's what you're asking for in your example, you want all
  # duplicates in feedID and all duplicates in feedID2 removed, not just
  # duplicate combinations of feedID and feedID2. See .keep_all in ?distinct
  distinct(feedID,
           .keep_all = T) %>% 
  distinct(feedID2,
           .keep_all = T) %>% 
  # Remove the sorted pair id
  select(-revCheck) %>% 
  # Return a dataframe
  as.data.frame(.)