假设我要为列查找重复的行:
cols<-c("col1", "col2")
我知道f4的重复行是:
Jo<-df4[duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE), ]
并从数据集中删除这些重复的行:
No<-df4[!(duplicated(df4[cols]) | duplicated(df4[cols], fromLast = TRUE)), ]
我想修改以上代码。假设有一列称为模式。它需要1到4之间的整数。我不希望所有重复的行都具有相同的mode == 2。
示例
col1 col2 mode
1 3 5
5 3 9
1 2 1
1 2 1
3 2 2
3 2 2
4 1 3
4 1 2
4 1 2
输出
Jo:
col1 col2 mode
1 2 1
1 2 1
4 1 3
4 1 2
4 1 2
No:
col1 col2 mode
1 3 5
5 3 9
3 2 2
3 2 2
在上面的示例中,在第3和第4行中,因为mode == 2都不是重复的,但是在最后三行中,因为其中一个不是2,所以是重复的
答案 0 :(得分:1)
基于更新后的数据集,
library(dplyr)
out1 <- df2 %>%
group_by_at(vars(cols)) %>%
filter(n() > 1, !all(mode ==2))
out2 <- anti_join(df2, out1)
out1
# A tibble: 5 x 3
# Groups: col1, col2 [2]
# col1 col2 mode
# <int> <int> <int>
#1 1 2 1
#2 1 2 1
#3 4 1 3
#4 4 1 2
#5 4 1 2
out2
# col1 col2 mode
#1 1 3 5
#2 5 3 9
#3 3 2 2
#4 3 2 2
或与data.table
library(data.table)
i1 <- setDT(df2)[ , .I[.N > 1 & !all(mode == 2)], by = cols]$V1
df2[i1]
# col1 col2 mode
#1: 1 2 1
#2: 1 2 1
#3: 4 1 3
#4: 4 1 2
#5: 4 1 2
df2[!i1]
# col1 col2 mode
#1: 1 3 5
#2: 5 3 9
#3: 3 2 2
#4: 3 2 2
或使用base R
i1 <- duplicated(df2[1:2])|duplicated(df2[1:2], fromLast = TRUE)
out11 <- df2[i1 & with(df2, !ave(mode==2, col1, col2, FUN = all)),]
out22 <- df2[setdiff(row.names(df2), row.names(out11)),]
df2 <- structure(list(col1 = c(1L, 5L, 1L, 1L, 3L, 3L, 4L, 4L, 4L),
col2 = c(3L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), mode = c(5L,
9L, 1L, 1L, 2L, 2L, 3L, 2L, 2L)), class = "data.frame", row.names = c(NA,
-9L))