使用R删除基于每个组的重复数据

时间:2019-07-19 13:08:23

标签: r group-by duplicates

我有一个数据集,其中包含员工ID,姓名和他们的银行帐户信息。这些雇员中有一些具有重复的名字,其中相同的雇员ID或相同的雇员名称具有不同的雇员ID。这些员工中很少有具有相同名称的相同银行帐户信息,而有些具有相同名称的不同银行帐号。目的是找到姓名相同但银行帐号不同的那些雇员。这是数据示例:

| Emp_id |   Name  | Bank Account |
|--------|:-------:|-------------:|
| 123    |   Joan  |         6758 |
| 134    |  Karyn  |         1244 |
| 143    | Larry   | 4900         |
| 143    | Larry   | 5201         |
| 235    | Larry   | 5201         |
| 433    | Larry   | 5201         |
| 231    | Larry   | 5201         |
| 120    | Amy     | 7890         |
| 135    | Amy     | 7890         |
| 150    |  Chris  | 1280         |
| 150    | Chris   | 6565         |
| 900    | Cassy   | 1280         |
| 900    | Cassy   | 9873         |

我必须根据他们的姓名找到可以重复的员工,这样我才能成功完成工作。完成此操作后,我必须确定姓名相同但银行帐号不同的员工。现在的问题是,它不是根据姓名对雇员进行分组,也不是在搜索其他银行帐户。相反,它正在查找不同个人的帐号,如果发现相同,则会删除重复值之一。例如,Chris和Cassy具有相同的银行帐号“ 1280”,因此它被标识为相同,并自动删除Chris的记录之一(输出中的银行帐号为1280)。我得到的输出如下所示:

| Emp_id |  Name | Bank Account |
|--------|:-----:|-------------:|
| 120    |  Amy  |         7890 |
| 900    | Cassy |         1280 |
| 900    | Cassy | 9873         |
| 150    | Chris | 6565         |
| 143    | Larry | 4900         |
| 143    | Larry | 5201         |

这是我遵循的代码:

sample=data.frame(Id=c("123","134","143","143","235","433","231","120","135","150","150","900","900"),
Name=c("Joan","Karyn","Larry","Larry","Larry","Larry","Larry","Amy","Amy","Chris","Chris","Cassy","Cassy"),
Bank_Account=c("6758","1244","4900","5201","5201","5201","5201","7890","7890","1280","6565","1280","9873"))
n_occur <- data.frame(table(sample$Name))
n_occur=n_occur[n_occur$Freq > 1,]
Duplicates=sample[sample$Name %in% n_occur$Var1[n_occur$Freq > 1],]
Duplicates=Duplicates %>% arrange(Duplicates$Name, Duplicates$Name)
Duplicates=Duplicates[!duplicated(Duplicates$Bank_Account),]

但是,实际输出应在每个名称(相同名称)中考虑银行帐号。输出应如下所示:

| Emp_id |   Name  | Bank Account |
|--------|:-------:|-------------:|
| 900    |  Cassy  |1280          |
| 900    |  Cassy  |9873          |
| 150    |  Chris  | 1280         |
| 150    | Chris   | 6565         |
| 143    | Larry   | 4900         |
| 143    | Larry   | 5201         |

有人可以指导我输入正确的代码吗?

2 个答案:

答案 0 :(得分:1)

我们可以使用n_distinctfilter

library(dplyr)
sample %>% 
    group_by(Name) %>%
    filter(n() > 1) %>%
    group_by(Id, add = TRUE) %>% 
    filter(n_distinct(Bank_Account) > 1) %>%
    arrange(desc(Id))
# A tibble: 6 x 3
# Groups:   Name, Id [3]
#  Id    Name  Bank_Account
#  <fct> <fct> <fct>       
#1 900   Cassy 1280        
#2 900   Cassy 9873        
#3 150   Chris 1280        
#4 150   Chris 6565        
#5 143   Larry 4900        
#6 143   Larry 5201      

答案 1 :(得分:0)

第1步-识别重复的名称:

step_1 <- sample %>%
  arrange(Name) %>%
  mutate(dup = duplicated(Name)) %>%
  filter(Name %in% unique(as.character(Name[dup == T])))

第2步-为这些名称标识重复的帐户:

step_2 <- step_1 %>%
  group_by(Name, Bank_Account) %>%
  mutate(dup = duplicated(Bank_Account)) %>%
  filter(dup == F)