我有一个数据集,其中包含员工ID,姓名和他们的银行帐户信息。这些雇员中有一些具有重复的名字,其中相同的雇员ID或相同的雇员名称具有不同的雇员ID。这些员工中很少有具有相同名称的相同银行帐户信息,而有些具有相同名称的不同银行帐号。目的是找到姓名相同但银行帐号不同的那些雇员。这是数据示例:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 123 | Joan | 6758 |
| 134 | Karyn | 1244 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
| 235 | Larry | 5201 |
| 433 | Larry | 5201 |
| 231 | Larry | 5201 |
| 120 | Amy | 7890 |
| 135 | Amy | 7890 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |
我必须根据他们的姓名找到可以重复的员工,这样我才能成功完成工作。完成此操作后,我必须确定姓名相同但银行帐号不同的员工。现在的问题是,它不是根据姓名对雇员进行分组,也不是在搜索其他银行帐户。相反,它正在查找不同个人的帐号,如果发现相同,则会删除重复值之一。例如,Chris和Cassy具有相同的银行帐号“ 1280”,因此它被标识为相同,并自动删除Chris的记录之一(输出中的银行帐号为1280)。我得到的输出如下所示:
| Emp_id | Name | Bank Account |
|--------|:-----:|-------------:|
| 120 | Amy | 7890 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
这是我遵循的代码:
sample=data.frame(Id=c("123","134","143","143","235","433","231","120","135","150","150","900","900"),
Name=c("Joan","Karyn","Larry","Larry","Larry","Larry","Larry","Amy","Amy","Chris","Chris","Cassy","Cassy"),
Bank_Account=c("6758","1244","4900","5201","5201","5201","5201","7890","7890","1280","6565","1280","9873"))
n_occur <- data.frame(table(sample$Name))
n_occur=n_occur[n_occur$Freq > 1,]
Duplicates=sample[sample$Name %in% n_occur$Var1[n_occur$Freq > 1],]
Duplicates=Duplicates %>% arrange(Duplicates$Name, Duplicates$Name)
Duplicates=Duplicates[!duplicated(Duplicates$Bank_Account),]
但是,实际输出应在每个名称(相同名称)中考虑银行帐号。输出应如下所示:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 900 | Cassy |1280 |
| 900 | Cassy |9873 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
有人可以指导我输入正确的代码吗?
答案 0 :(得分:1)
我们可以使用n_distinct
到filter
library(dplyr)
sample %>%
group_by(Name) %>%
filter(n() > 1) %>%
group_by(Id, add = TRUE) %>%
filter(n_distinct(Bank_Account) > 1) %>%
arrange(desc(Id))
# A tibble: 6 x 3
# Groups: Name, Id [3]
# Id Name Bank_Account
# <fct> <fct> <fct>
#1 900 Cassy 1280
#2 900 Cassy 9873
#3 150 Chris 1280
#4 150 Chris 6565
#5 143 Larry 4900
#6 143 Larry 5201
答案 1 :(得分:0)
第1步-识别重复的名称:
step_1 <- sample %>%
arrange(Name) %>%
mutate(dup = duplicated(Name)) %>%
filter(Name %in% unique(as.character(Name[dup == T])))
第2步-为这些名称标识重复的帐户:
step_2 <- step_1 %>%
group_by(Name, Bank_Account) %>%
mutate(dup = duplicated(Bank_Account)) %>%
filter(dup == F)