Question

我有两个数据帧（x＆amp; y），其中ID为student_name，father_name和mother_name。由于印刷错误（“n”而不是“m”，随机白色空间等），我有大约60％的值没有对齐，尽管我可以关注数据并看到它们应该。有没有办法以某种方式降低不匹配的级别，以便手动编辑，因为至少可行？数据帧有大约700K的观测值。

R最好。我知道一点python，以及一些基本的unix工具。附：我读了agrep()，但不明白它对实际数据集的作用，特别是当匹配超过一个变量时。

更新（已发布的赏金数据）：

Here是两个示例数据框sites_a和sites_b。它们可以在数字列lat和lon以及sitename列上进行匹配。了解如何在a）{} {1}} + lat，b）lon或c）两者上完成此操作会很有用。

您可以获取作为要点发布的文件test_sites.R。

理想情况下，答案将以

结束

sitename

Answer 1

使用Levenshtein edit distance进行近似字符串匹配的agrep函数（基数R的一部分）可能值得尝试。在不知道您的数据是什么样的情况下，我无法真正建议一个有效的解决方案。但这是一个建议......它在一个单独的列表中记录匹配（如果有多个同样好的匹配，那么这些也被记录）。假设您的data.frame名为df：

l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
  father_id <- with(df,which(student_name[i] == father_name))
  if(length(father_id) == 1){
    matches[['father']][[i]] <- father_id
  } else {
    old_father_id <- NULL
    ## try to find the total                                                                                                                                 
    for(m in 10:1){ ## m is the maximum distance                                                                                                             
      father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
      if(length(father_id) == 1 || m == 1){
        ## if we find a unique match or if we are in our last round, then stop                                                                               
        matches[['father']][[i]] <- father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) > 0) {
        ## if we can't do better than multiple matches, then record them anyway                                                                              
        matches[['father']][[i]] <- old_father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) == 0) {
        ## if the nearest match is more than 10 different from the current pattern, then stop                                                                
        break
      }
    }
  }
}

mother_name的代码基本相同。你甚至可以将它们放在一个循环中，但这个例子只是为了说明。

Answer 2

这会获取一个常见列名称列表，基于所有这些列的agrep匹配，然后如果all.x或all.y等于TRUE，则会附加填充的不匹配记录缺少NA的列。与merge不同，要匹配的列名在每个数据框中需要相同。挑战似乎是正确设置agrep选项以避免虚假匹配。

  agrepMerge <- function(df1, df2, by, all.x = FALSE, all.y = FALSE, 
    ignore.case = FALSE, value = FALSE, max.distance = 0.1, useBytes = FALSE) {

    df1$index <- apply(df1[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
    df2$index <- apply(df2[,by, drop = FALSE], 1, paste, sep = "", collapse = "")

    matches <- lapply(seq_along(df1$index), function(i, ...) {
      agrep(df1$index[i], df2$index, ignore.case = ignore.case, value = value,
            max.distance = max.distance, useBytes = useBytes)
    })

    df1_match <- rep(1:nrow(df1), sapply(matches, length))
    df2_match <- unlist(matches)

    df1_hits <- df1[df1_match,]
    df2_hits <- df2[df2_match,]

    df1_miss <- df1[setdiff(seq_along(df1$index), df1_match),]
    df2_miss <- df2[setdiff(seq_along(df2$index), df2_match),]

    remove_cols <- colnames(df2_hits) %in% colnames(df1_hits)

    df_out <- cbind(df1_hits, df2_hits[,!remove_cols])

    if(all.x) {
      missing_cols <- setdiff(colnames(df_out), colnames(df1_miss))
      df1_miss[missing_cols] <- NA
      df_out <- rbind(df_out, df1_miss)
    }
    if(all.x) {
      missing_cols <- setdiff(colnames(df_out), colnames(df2_miss))
      df2_miss[missing_cols] <- NA
      df_out <- rbind(df_out, df2_miss)
    }
    df_out[,setdiff(colnames(df_out), "index")]
}

通过R中变量的模糊匹配进行合并

2 个答案: