基于多列中直接和间接相似性对变量进行分组的快速方法

时间:2019-06-24 16:57:32

标签: r loops optimization data.table grepl

我有一个相对较大的数据集(1,750,000行,5列),其中包含具有唯一ID值的记录(第一列),由四个条件(其他4列)描述。一个小例子是:

# example
library(data.table)
dt <- data.table(id=c("a1","b3","c7","d5","e3","f4","g2","h1","i9","j6"), 
                 s1=c("a","b","c","l","l","v","v","v",NA,NA), 
                 s2=c("d","d","e","k","k","o","o","o",NA,NA),
                 s3=c("f","g","f","n","n","s","r","u","w","z"),
                 s4=c("h","i","j","m","m","t","t","t",NA,NA))

如下所示:

   id   s1   s2 s3   s4
 1: a1    a    d  f    h
 2: b3    b    d  g    i
 3: c7    c    e  f    j
 4: d5    l    k  n    m
 5: e3    l    k  n    m
 6: f4    v    o  s    t
 7: g2    v    o  r    t
 8: h1    v    o  u    t
 9: i9 <NA> <NA>  w <NA>
10: j6 <NA> <NA>  z <NA>

我的最终目标是在任何说明列中查找所有具有相同字符的记录(不考虑NA),并将它们分组为新的ID,以便我可以轻松地识别重复的记录。这些ID是通过串联每行的ID来构造的。

事情变得更加混乱,因为我可以直接直接找到重复描述的记录。因此,我目前分两个步骤进行此操作。

步骤1-根据直接重复项构造重复的ID

# grouping ids with duplicated info in any of the columns
#sorry, I could not find search for duplicates using multiple columns simultaneously...
dt[!is.na(dt$s1),ids1:= paste(id,collapse="|"), by = list(s1)]
dt[!is.na(dt$s1),ids2:= paste(id,collapse="|"), by = list(s2)]
dt[!is.na(dt$s1),ids3:= paste(id,collapse="|"), by = list(s3)]
dt[!is.na(dt$s1),ids4:= paste(id,collapse="|"), by = list(s4)]

# getting a unique duplicated ID for each row
dt$new.id <- apply(dt[,.(ids1,ids2,ids3,ids4)], 1, paste, collapse="|")
dt$new.id <- apply(dt[,"new.id",drop=FALSE], 1, function(x) paste(unique(strsplit(x,"\\|")[[1]]),collapse="|"))

此操作将产生以下结果,唯一的重复ID定义为“ new.id”:

   id   s1   s2 s3   s4     ids1     ids2  ids3     ids4   new.id
 1: a1    a    d  f    h       a1    a1|b3 a1|c7       a1 a1|b3|c7
 2: b3    b    d  g    i       b3    a1|b3    b3       b3    b3|a1
 3: c7    c    e  f    j       c7       c7 a1|c7       c7    c7|a1
 4: d5    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 5: e3    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 6: f4    v    o  s    t f4|g2|h1 f4|g2|h1    f4 f4|g2|h1 f4|g2|h1
 7: g2    v    o  r    t f4|g2|h1 f4|g2|h1    g2 f4|g2|h1 f4|g2|h1
 8: h1    v    o  u    t f4|g2|h1 f4|g2|h1    h1 f4|g2|h1 f4|g2|h1
 9: i9 <NA> <NA>  w <NA>     <NA>     <NA>  <NA>     <NA>       NA
10: j6 <NA> <NA>  z <NA>     <NA>     <NA>  <NA>     <NA>       NA

请注意,记录“ b3”和“ c7”是通过“ a1”间接复制的(所有其他示例都是直接复制,应保持相同)。这就是为什么我们需要下一步。

步骤2-基于间接重复更新重复的ID

#filtering the relevant columns for the indirect search
dt = dt[,.(id,new.id)]

#creating the patterns to be used by grepl() for the look-up for each row
dt[,patt:= .(paste(paste("^",id,"\\||",sep=""),paste("\\|",id,"\\||",sep=""),paste("\\|",id,"$",sep=""),collapse = "" ,sep="")), by = list(id)]

#Transforming the ID vector into factor and setting it as a 'key' to the data.table (speed up the processing)
dt$new.id = as.factor(dt$new.id)
setkeyv(dt, c("new.id"))

#Performing the loop using sapply
library(stringr)
for(i in 1:nrow(dt)) {
  pat = dt$patt[i] # retrieving the research pattern
  tmp = dt[new.id %like% pat] # searching the pattern using grepl()
  if(dim(tmp)[1]>1) {
    x = which.max(str_count(tmp$new.id, "\\|"))
    dt$new.id[i] = as.character(tmp$new.id[x])
  }
}

#filtering the final columns 
dt = dt[,.(id,new.id)]

决赛桌如下:

   id   new.id
 1: a1 a1|b3|c7
 2: b3 a1|b3|c7
 3: c7 a1|b3|c7
 4: d5    d5|e3
 5: e3    d5|e3
 6: f4 f4|g2|h1
 7: g2 f4|g2|h1
 8: h1 f4|g2|h1
 9: i9       NA
10: j6       NA

请注意,现在,前三个记录(“ a1”,“ b3”,“ c7”)被分组为一个更宽泛的重复ID,其中包含直接记录和间接记录。

一切正常,但是我的代码太慢了。花了整整2天的时间来运行数据集的一半(〜800,0000)。我可以将循环并行化为不同的内核,但是仍然需要几个小时。而且我几乎可以肯定,我可以以更好的方式使用data.table功能,也许可以在循环中使用“ set”。今天,我花了数小时试图使用data.table实现相同的代码,但是我对它的语法并不陌生,在这里确实很困难。关于如何优化此代码的任何建议?

注意:代码最慢的部分是循环,而在循环内部,效率最低的步骤是data.table内部模式的grepl()。似乎为data.table设置了一个“键”可以加快该过程,但是在我的情况下,我没有改变执行grepl()所花费的时间。

2 个答案:

答案 0 :(得分:12)

您可能将此视为网络问题。在这里,我使用igraph包中的函数。基本步骤:

  1. melt数据转换为长格式。

  2. 使用graph_from_data_frame创建一个图形,其中'id'和'value'列被视为边列表。

  3. 使用components来获取图形的连接组件,即通过其条件直接或间接连接的“ id”。

  4. 选择membership元素以获取“每个顶点所属的群集ID”。

  5. 将成员资格加入原始数据。

  6. 按集群成员身份串联“ id”。


library(igraph)

# melt data to long format, remove NA values
d <- melt(dt, id.vars = "id", na.rm = TRUE)

# convert to graph
g <- graph_from_data_frame(d[ , .(id, value)])

# get components and their named membership id 
mem <- components(g)$membership

# add membership id to original data
dt[.(names(mem)), on = .(id), mem := mem] 

# for groups of length one, set 'mem' to NA
dt[dt[, .I[.N == 1], by = mem]$V1, mem := NA]

如果需要,可将'id'与'mem'列连接起来(对于非NA'mem')(恕我直言,这只会使进一步的数据处理更加困难;))。无论如何,我们开始:

dt[!is.na(mem), id2 := paste(id, collapse = "|"), by = mem]

#     id   s1   s2 s3   s4  mem      id2
#  1: a1    a    d  f    h    1 a1|b3|c7
#  2: b3    b    d  g    i    1 a1|b3|c7
#  3: c7    c    e  f    j    1 a1|b3|c7
#  4: d5    l    k  l    m    2    d5|e3
#  5: e3    l    k  l    m    2    d5|e3
#  6: f4    o    o  s    o    3 f4|g2|h1
#  7: g2    o    o  r    o    3 f4|g2|h1
#  8: h1    o    o  u    o    3 f4|g2|h1
#  9: i9 <NA> <NA>  w <NA>   NA     <NA>
# 10: j6 <NA> <NA>  z <NA>   NA     <NA>

在这个小例子中,图的基本图只是为了说明连接的组件:

plot(g, edge.arrow.size = 0.5, edge.arrow.width = 0.8, vertex.label.cex = 2, edge.curved = FALSE)

enter image description here

答案 1 :(得分:6)

我认为这种递归方法可以满足您的需求。 基本上,它在每一列上执行自联接, 一次一个, 如果匹配多于一行 (即正在考虑的行以外的其他行), 它会保存匹配中的所有唯一ID。 它通过利用secondary indices避免将行与NA一起使用。 诀窍是我们进行两次递归, 一次是id,一次是新创建的new_id

dt[, new_id := .(list(character()))]

get_ids <- function(matched_ids, new_id) {
  if (length(matched_ids) > 1L) {
    list(unique(
      c(new_id[[1L]], unlist(matched_ids))
    ))
  } else {
    new_id
  }
}

find_recursively <- function(dt, cols, pass) {
  if (length(cols) == 0L) return(invisible())

  current <- cols[1L]
  next_cols <- cols[-1L]

  next_dt <- switch(
    pass,

    first = dt[!list(NA_character_),
               new_id := dt[.SD, .(get_ids(x.id, i.new_id)), on = current, by = .EACHI]$V1,
               on = current],

    second = dt[!list(NA_character_),
                new_id := dt[.SD, .(get_ids(x.new_id, i.new_id)), on = current, by = .EACHI]$V1,
                on = current]
  )

  find_recursively(next_dt, next_cols, pass)
}

find_recursively(dt, paste0("s", 1:4), "first")
find_recursively(dt, paste0("s", 1:4), "second")

dt[, new_id := sapply(new_id, function(nid) {
  ids <- unlist(nid)
  if (length(ids) == 0L) {
    NA_character_
  } else {
    paste(ids, collapse = "|")
  }
})]

print(dt)
    id   s1   s2 s3   s4   new_id
 1: a1    a    d  f    h a1|b3|c7
 2: b3    b    d  g    i a1|b3|c7
 3: c7    c    e  f    j a1|c7|b3
 4: d5    l    k  l    m    d5|e3
 5: e3    l    k  l    m    d5|e3
 6: f4    o    o  s    o f4|g2|h1
 7: g2    o    o  r    o f4|g2|h1
 8: h1    o    o  u    o f4|g2|h1
 9: i9 <NA> <NA>  w <NA>     <NA>
10: j6 <NA> <NA>  z <NA>     <NA>

联接使用this idiom