我需要删除组(ID)中两列(feedID和feedID2)的重复组合,同时在数据集中保留大量其他列。应删除所有带有重复项的行,无论它是第2列中的A还是第3列中的B,反之亦然。 另外,我想删除两列中存在例如A的所有行,或者其中一列中存在NA的行。 我不能对列之间的数据进行排序,即如果A在列nr 2中,它应该保留在列nr 2中。
我知道这可能是一个重复的问题,但其他答案似乎都不适用于我的数据集,或者要求同样的事情。 例如。 Finding unique combinations irrespective of position Removing duplicate combinations in R (irrespective of order)
test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1", "D2" ),
feedID2 = c("A1", "G2", "A1", "G2", "NA", "D1", "D2", "NA" ))
desiredoutput <- data.frame(ID= c("49V", "52V", "52V"),
feedID = c("A1","B1", "D1" ),
feedID2 = c("G2", "D1", "D2" ))
如果在不同的列
中,以下代码不会删除重复项 test2 <- test [!duplicated(test[,c("ID","feedID", "feedID2")]),]
此代码根本不执行任何操作,但不会引发错误
test2 <- test%>% distinct(1,2,3) # where numbers refer to the columns
此代码产生错误,对于dimnames,不确定这意味着什么。我不知道我的测试数据,我不知道为什么,不能重现错误......
indx <- !duplicated(t(apply(test, 1, sort))) # finds non - duplicates in sorted rows
test[indx, ]
有什么想法吗?
答案 0 :(得分:1)
您的数据,但"NA"
已更改为NA
和stringsAsFactors=F
test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1", "D2" ),
feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2", NA ),
stringsAsFactors=F)
library(dplyr)
test %>%
filter(complete.cases(.)) %>% # Remove rows with NA
rowwise() %>% # Perform next step by row
mutate(dup=paste0(sort(c(feedID,feedID2)),collapse="")) %>% # Sort and combine feedID and feedID2
ungroup() %>%
group_by(ID) %>% # Remove rowwise grouping
mutate(dup=duplicated(dup)) %>% # Find duplicated feedID:feedID2 pairs
filter(dup==F) %>% # Remove duplicated pairs
filter(!(feedID==feedID2)) %>% # Remove where feedID == feedID2
select(-dup) # Remove dummy column
ID feedID feedID2
1 49V A1 G2
2 52V B1 D1
3 52V D1 D2
如果您只想在NA
&amp;中寻找feedID
feedID2
将filter(complete.cases(.))
替换为filter(!is.na(feedID) & !is.na(feedID2))
答案 1 :(得分:0)
这是一个基本解决方案,使用complete.cases
功能,还可以创建排序feedID
列:
# remove any rows with NA values
test <- test[complete.cases(test[,c('ID', 'feedID','feedID2')]),]
#remove any rows with feedID == feedID2
test <- test[!(test$feedID == test$feedID2),]
# add new feedID3 column
test$feedID3 <- apply(test, 1, function(x) paste(sort(c(x[2], x[3])), collapse = '-'))
# remove any duplicates, and remove last column
test[!duplicated(test[,c('feedID3', 'ID')]), -4]
ID feedID feedID2
2 49V A1 G2
6 52V B1 D1
7 52V D1 D2
请注意,我们已将"NA"
转换为NA
,我们还设置了stringsAsFactors = TRUE
test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1", "D2" ),
feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2", NA ),
stringsAsFactors = FALSE)
答案 2 :(得分:0)
将“NA”更改为NA,并设置stringsAsFactors = F
library(dplyr)
library(stringr)
test <- data.frame(ID= c("49V", "49V","49V", "49V", "49V", "52V", "52V", "52V"),
feedID = c("A1", "A1", "G2", "A1", "G2", "B1", "D1", "D2" ),
feedID2 = c("A1", "G2", "A1", "G2", NA, "D1", "D2", NA ),
stringsAsFactors = F)
desiredoutput <- data.frame(ID= c("49V", "52V", "52V"),
feedID = c("A1","B1", "D1" ),
feedID2 = c("G2", "D1", "D2" ),
stringsAsFactors = F)
test %>%
# Remove NAs and all rows where the IDs are equal
filter(!is.na(feedID),
!is.na(feedID2),
feedID != feedID2) %>%
# Group rowwise and create a sorted pair of the two ID columns
rowwise() %>%
mutate(revCheck = str_c(str_sort(c(feedID, feedID2)), collapse = "")) %>%
ungroup() %>%
# Find distinct ID pairs and keep all variables
distinct(revCheck,
.keep_all = T) %>%
# Find distinct rows for each ID pair. I kept these separate because I
# think that's what you're asking for in your example, you want all
# duplicates in feedID and all duplicates in feedID2 removed, not just
# duplicate combinations of feedID and feedID2. See .keep_all in ?distinct
distinct(feedID,
.keep_all = T) %>%
distinct(feedID2,
.keep_all = T) %>%
# Remove the sorted pair id
select(-revCheck) %>%
# Return a dataframe
as.data.frame(.)