Question

在删除在两个特定列中重复的行时，是否可以根据第三列优先保留重复行之一？

考虑以下示例：

# Example dataframe.
df <- data.frame(col.1 = c(1, 1, 1, 2, 2, 2, 3),
                 col.2 = c(1, 1, 1, 2, 2, 2, 2),
                 col.3 = c('b', 'c', 'a', 'b', 'a', 'b', 'c'))
# Output
col.1 col.2 col.3
    1     1     b
    1     1     c
    1     1     a
    2     2     b
    2     2     a
    2     2     b
    3     2     c

我想删除col.1和col.2中都重复的行，同时始终保留具有col.3 == 'a'的重复行，否则不希望重复的行是保留。在此示例的情况下，结果数据帧将如下所示：

# Output.
col.1 col.2 col.3
    1     1     a
    2     2     a
    3     2     c

感谢所有帮助！

Answer 1

我们可以先在col.3上订购，然后删除重复项，即

d1 <- df[with(df, order(col.3)),]
d1[!duplicated(d1[c(1, 2)]),]
#  col.1 col.2 col.3
#3     1     1     a
#5     2     2     a
#7     3     2     c

Answer 2

因为您要保留a，所以一个选择是arrange，并在每个组中获得第一行。

library(dplyr)

df %>%
  arrange_all() %>%
  group_by(col.1, col.2) %>%
  slice(1)

#  col.1 col.2 col.3
#  <dbl> <dbl> <fct>
#1     1     1 a    
#2     2     2 a    
#3     3     2 c

如果col.3的值不是连续的，则可以通过手动操作来手动arrange

df %>%
  arrange(col.1, col.2, match(col.3, c("a", "b", "c"))) %>%
  group_by(col.1, col.2) %>%
  slice(1)

Answer 3

使用dplyr，您还可以执行以下操作：

df %>%
 group_by(col.1, col.2) %>%
 filter(col.3 == min(col.3))

  col.1 col.2 col.3
  <dbl> <dbl> <chr>
1     1     1 a    
2     2     2 a    
3     3     2 c

或者：

df %>%
 group_by(col.1, col.2) %>%
 filter(dense_rank(col.3) == 1)

或者：

df %>%
 group_by(col.1, col.2) %>%
 slice(which.min(match(col.3, letters[1:26])))

Answer 4

一种选择是，如果行数大于1或按'col.1'，'col.2'和slice分组具有'col.3'的行为“ a”，否则返回第一行

library(dplyr)
df %>% 
   group_by(col.1, col.2) %>%
   slice(if(n() > 1) which(col.3 == 'a') else 1)
# A tibble: 3 x 3
# Groups:   col.1, col.2 [3]
#  col.1 col.2 col.3
#  <dbl> <dbl> <fct>
#1     1     1 a    
#2     2     2 a    
#3     3     2 c

或者另一种选择是按'col.1'，'col.2'分组，然后对slice从match得到的索引进行分组，将'a'与'col.3'分组。如果存在nomatch，则返回索引1。

df %>% 
   group_by(col.1, col.2) %>% 
   slice(match("a", col.3, nomatch = 1))
# A tibble: 3 x 3
# Groups:   col.1, col.2 [3]
#  col.1 col.2 col.3
#  <dbl> <dbl> <fct>
#1     1     1 a    
#2     2     2 a    
#3     3     2 c

Answer 5

您可以使用dplyr::distinct，它具有一个.keep.all参数，该参数可让您为每个不同的集合保留整个第一行。首先，我们需要排序以将"a"放在顶部：

library(dplyr)
df %>%
  arrange(col.1, col.2, col.3 != "a") %>%
  distinct(col.1, col.2, .keep_all = TRUE)
#>   col.1 col.2 col.3
#> 1     1     1     a
#> 2     2     2     a
#> 3     3     2     c

优先删除数据帧中的部分重复项

5 个答案: