R:带有通配符

时间:2016-06-26 13:06:24

标签: r duplicates

这个问题很容易说出来,但实现所需功能似乎对我来说太具挑战性了 我想要一个函数,它给我data.frame的所有行,除了n列之外是相同的。换句话说:一个给我几乎重复的行的函数(在这些行中只允许n个条目不同)。

Here我发现一些看起来与我的数据非常相似的数据。我使用这些数据的前两行来生成我的示例数据:

gw <- structure(list(TIME = structure(c(2L, 1L, 2L, 2L, 1L), .Label = c("05.12.2000", 
                                                                         "26.07.2000"), class = "factor"), GAUGE_ID = c(198L, 200L, 198L, 
                                                                                                                        198L, 200L), PH = c(7.22, 7.2, 7.22, 7.22, 7.2), EH = c(100L, 
                                                                                                                                                                                470L, 100L, 100L, 470L), CON = c(595L, 672L, 595L, 595L, 672L
                                                                                                                                                                                ), TEMP = c(9.1, 10, 9.1, 9.1, 10), O2MG = c(0, 3.8, 0, 0.005, 
                                                                                                                                                                                                                             3.8), NH4 = c(0.24, 0.06, 0.24, 0.24, 0.06), NH4N = c(0.19, 0.05, 
                                                                                                                                                                                                                                                                                   0.19, 0.19, 0.05), PO4 = c(0.061, 0.031, 0.061, 0.061, 0.031), 
                      OPO4P = c(0.02, 0.01, 0.02, 0.02, 0.01), SAK = c(9.8, 11.3, 
                                                                       9.8, 9.8, 11.3), CL = c(22.76, 18.49, 22.76, 22.76, 18.49
                                                                       ), BR = c(0, 0.06, 0, 0.015, 0.06), NO2 = c(0, 0.06, 0, 0.005, 
                                                                                                                   0.06), NO3 = c(0.02, 46.61, 0.02, 0.015, 46.61), SO4 = c(39.91, 
                                                                                                                                                                            60.17, 39.91, 39.91, 60.17), NA. = c(8.19, 8.34, 8.19, 8.19, 
                                                                                                                                                                                                                 8.34), K = c(3.23, 1.03, 3.23, 3.23, 1.03), MG = c(4.21, 
                                                                                                                                                                                                                                                                    7.82, 4.21, 4.21, 7.82), CA = c(110.72, 115.77, 110.72, 110.72, 
                                                                                                                                                                                                                                                                                                    115.77), DOC = c(4.67, 7.9, 4.67, 4.67, 7.9), FE2 = c(1.62, 
                                                                                                                                                                                                                                                                                                                                                          0.12, 1.62, 1.62, 0.12), MN = c(NA, NA, NA, NA, NA), HCO3 = c(5.11, 
                                                                                                                                                                                                                                                                                                                                                                                                                        5.05, 5.11, 5.11, 5.05)), .Names = c("TIME", "GAUGE_ID", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                             "PH", "EH", "CON", "TEMP", "O2MG", "NH4", "NH4N", "PO4", "OPO4P", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                             "SAK", "CL", "BR", "NO2", "NO3", "SO4", "NA.", "K", "MG", "CA", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                             "DOC", "FE2", "MN", "HCO3"), row.names = c(NA, 5L), class = "data.frame")

我尝试编写一个能够完成我想要的功能的结果:

ulti.dup <- function(x, widlcards = NULL, ...){

  if(is.null(wildcards)){
    print(which(duplicated(x, ...)))
  } else if(!is.numeric(wildcards)){
    stop("wildcards has to be the maximum number of not matching columns and though numeric")
  } else{
    comb <- combn(1:ncol(x), m = wildcards, simplify = FALSE)
    dups <- c()
    for(col in comb){
      dups <- c(dups, which(duplicated(x[, -col], ...)))
    }
    print(dups[-which(duplicated(dups))])
  }
}

但是,ulti.dup只找到重复的第3行和第5行,而不是wildcards >= 4找到它应该找到的第4行。

对于对更多背景信息感兴趣的人:我有两个data.frames分享一些样本,但其中一个data.frames的值小于检测限,取而代之的是检测限的一半(就像我的例子中的第4行和第5行的情况一样)。我需要合并那些data.frames并删除所有重复的样本(行)。

1 个答案:

答案 0 :(得分:0)

好吧,似乎我的功能 - 在问题中提供 - 只有一个小错字,我没有意识到,因为我的工作区中还有另一个对象wildcards。现在是一个非常缓慢但有效的代码:

ulti.dup <- function(x, wildcards = NULL, ...){

  if(is.null(wildcards)){
    print(which(duplicated(x, ...)))
  } else if(!is.numeric(wildcards)){
    stop("wildcards has to be the maximum number of not matching columns and though numeric")
  } else{
    comb <- combn(1:ncol(x), m = wildcards, simplify = FALSE)
    dups <- c()
    for(col in comb){
      dups <- c(dups, which(duplicated(x[, -col], ...)))
    }
    print(sort(dups[-which(duplicated(dups))]))
  }
}