R - 合并时遇到重复错误

时间:2017-05-19 14:44:32

标签: r join merge

我有一个数据集,告诉我客户电子邮件,客户编号以及他们是否是管理员。我们需要在同一记录上提供客户管理员的电子邮件,我们可以从数据中得出,只要记录的客户编号等于管理员记录的编号,将管理员的电子邮件放在该行中。此外,第二个管理员'应该有自己的电子邮件显示在'管理员电子邮件中。字段而不是“第一个管理员”字段。为那个客户。

我通过将管理员子集化为新的数据帧,然后将管理数据集和客户编号上的原始数据集合并来完成此操作。问题是客户有2个管理员,因为连接导致重复记录。有没有办法解决这个问题,如果为1位客户列出2个管理员,请使用第1个管理员电子邮件'?

##sample Data
    df <- data.frame(Email = c("test1@gmail.com", "test2@gmail.com", "test3@gmail.com","test4@gmail.com","test5@gmail.com","test6@gmail.com", "test7@gmail.com"),
                     Admin = c("Y", "N", "N","Y","N", "Y", "N"),
                     CustNum = c("1111","1111","1111","2222","2222","2222", "2222"))

##My solution
admins <- subset(df, df$Admin == "Y")
output <- merge(df, admins, by = "CustNum", all.x = TRUE)
colnames(output)[colnames(output)=="Email.y"] <- "Admin_Email"


    expected <- data.frame(Email = c("test1@gmail.com", "test2@gmail.com", "test3@gmail.com","test4@gmail.com","test5@gmail.com","test6@gmail.com", "test7@gmail.com"),
                           Admin = c("Y", "N", "N","Y","N", "Y", "N"),
                           CustNum = c("1111","1111","1111","2222","2222","2222", "2222"),
                     Adminemail = c("test1@gmail.com","test1@gmail.com","test1@gmail.com","test4@gmail.com","test4@gmail.com","test6@gmail.com", "test4@gmail.com"))

2 个答案:

答案 0 :(得分:1)

我认为最简单的方法是使用for循环。但是有一种方法可以使用data.table,但我可以搞清楚......

工作解决方案 但不是最佳

df$Adminemail = NA

for(i in 1:nrow(df)){

    ### If the admin is himself then :
    if(df$Admin[i] == "Y"){
    df$Adminemail[i] = as.character(df$Email[i])
    }

    ### Otherwise it fill up with the first admin-email found
    else{
    sub <- df[df$CustNum == df$CustNum[i],]
    df$Adminemail[i] <- as.character(sub[sub$Admin=="Y",]$Email[1])
    }
    }

如果您的数据集很大,for循环可能会给您带来一些问题。但是,如果您可以创建唯一ID。我非常肯定data.table是一些更好,更优化的解决方案。

不工作的解决方案 ,但可能更好的途径

  df$Unique <- paste(df$Email,df$CustNum,sep="_")


  library(data.table)
  setDT(df) 
  setDT(admins)

  # inner join - use `nomatch` argument
  admins[df, nomatch=0L, on = "Unique"]

我在此post

找到了这段代码

答案 1 :(得分:1)

我没有使用循环找不到解决方案,但它有效,试试这个。

## sample Data
df <- data.frame(Email = c("test1@gmail.com", "test2@gmail.com", "test3@gmail.com","test4@gmail.com","test5@gmail.com","test6@gmail.com", "test7@gmail.com"),
             Admin = c("Y", "N", "N","Y","N", "Y", "N"),
             CustNum = c("1111","1111","1111","2222","2222","2222", "2222"))

## My solution
library(dplyr)
admins <- df %>% filter(Admin == 'Y') %>% 
    select(Email, Admin, CustNum) %>% 
    mutate(AdminEmail = Email)
# find the first match for each unique CustNum
ind = sapply(unique(admins$CustNum), function(x) which(admins$CustNum == x)[1])
first_match = admins[ind, ]
# merge data
output = full_join(df, admins, by = c('Email', 'CustNum', 'Admin'))
# fill in NAs
for (i in 1:nrow(output)) {
    if (is.na(output$AdminEmail[i])) {
        output$AdminEmail[i] = first_match$AdminEmail[which(first_match$CustNum == output$CustNum[i])]
    }
}