R重复数据删除记录不完全重复

时间:2018-04-08 18:34:43

标签: r duplicates data-cleaning

我有一个需要重复记录的记录列表,这些记录看起来像是同一组的组合,但使用常规函数来重复删除记录不起作用,因为这两列不是重复的。以下是一个可重复的例子。

df <- data.frame( A  =  c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"),

          B =  c("43","501","502","2","501","502","491","496","490","481","2","43","502","2","43","501"))

以下是我正在寻找的所需输出。

df_Final <- data.frame( A  =  c("2","2","2","331","391","481"),

          B =  c("43","501","502","491","496","490"))

2 个答案:

答案 0 :(得分:1)

我想这是想要找到列A中的元素首次出现在B

中的时候
idx = match(df$A, df$B)

如果A中的元素不在Bis.na(idx))或A中的元素在B中首次出现之前,请保留该行(seq_along(idx) < idx

df[is.na(idx) | seq_along(idx) < idx,]

对于这个或多或少的文字整数方法可能是创建然后删除一个临时列

library(tidyverse)
df %>% mutate(idx = match(A, B)) %>%
    filter(is.na(idx) | seq_along(idx) < idx) %>%
    select(-idx)

答案 1 :(得分:0)

您可以删除所有在

重新排序下重复的行
require(dplyr)
df %>%
    apply(1, sort) %>% t %>% 
    data.frame %>% 
    group_by_all %>% 
    slice(1)