优化解决方案以在大数据集上找到共同的第三个

时间:2017-12-06 22:19:34

标签: r optimization bigdata

这是我previous question的后续问题。我遇到一个问题,找到一个内存有效的解决方案,找到我的大数据集(350万组和620万人)的常见三分之一

使用igraph包的建议解决方案对于正常大小的数据集快速工作,遗憾的是通过为更大的数据集创建大矩阵来解决内存问题。类似的问题出现了我自己的解决方案使用连接的内连接,其中第三个内连接膨胀数据帧,所以我的电脑内存不足(16gb)。

    df.output <-   inner_join(df,df, by='group' ) %>% 
      inner_join(.,df, by=c('person.y'='person')) %>%
      inner_join(.,df, by=c('group.y'='group')) %>% 
      rename(person_in_common='person.y', pers1='person.x',pers2='person') %>% 
      select(pers1,pers2,person_in_common) %>% 
      filter(pers1!=pers2) %>% 
      distinct() %>% 
      filter(person_in_common!=pers1 & person_in_common!=pers2)

    df.output[-3] <- t(apply(df.output[-3], 1, 
                             FUN=function(x) sort(x, decreasing=FALSE)))

    df.output <- unique(df.output)

小数据集示例和预期输出

df <- data.frame(group= c("a","a","b","b","b","c"),
             person = c("Tom","Jerry","Tom","Anna","Sam","Nic"), stringsAsFactors = FALSE)

df
    group person
1     a    Tom
2     a  Jerry
3     b    Tom
4     b   Anna
5     b    Sam
6     c    Nic

和预期结果

df.output
  pers1 pers2 person_in_common
1  Anna Jerry              Tom
2 Jerry   Sam              Tom
3   Sam   Tom             Anna
4  Anna   Tom              Sam
6  Anna   Sam              Tom

我很遗憾无法访问具有更多RAM的计算机,并且对云计算也没有太多经验,因此我希望能够在我的本地计算机上运行。我很感激输入如何优化任何解决方案或建议如何解决问题。

修改1

反映我实际数据大小的数据框。

set.seed(33)

Data <- data.frame(group = sample(1:3700000, 14000000, replace=TRUE),
                   person = sample(1:6800000, 14000000,replace = TRUE))

修改2

我的实际数据在更大的群组方面更复杂,每组的人数更多,作为示例数据。因此它会更加紧张。我无法弄清楚如何模拟这种结构,以便按照真实的数据下载:

Full person-group data

1 个答案:

答案 0 :(得分:1)

所以,我设法在你的测试数据上运行它(我有16 GB的RAM),但如果你在你的小例子上运行它,那么你会发现它没有给出相同的结果。我不明白为什么,但也许你可以解雇我。所以我会尝试解释每一步:

myFun <- function(dt) {
  require(data.table)
  # change the data do data.table:
  setDT(dt)
  # set key/order the data by group and person:
  setkey(dt, group, person)
  # I copy the initial data and change the name of soon to be merged column name to "p2"
  # which represents person2
  dta <- copy(dt)
  setnames(dta, "person", "p2")
  # the first merge using data.table:
  dt1 <- dt[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
  # now we remove rows where persons are the same:
  dt1 <- dt1[person != p2] # remove equal persons
  # and also we need to remove rows where person1 and person2 are the same,
  # just in different order , example:
  # 2:     a    Tom    Jerry
  # 3:     a  Jerry      Tom
  # is the same, if I get it right then you did this using apply in the end of code,
  # but it would be much better if we could reduce data now
  # also my approach will be much faster (we take pairwise min word to 2 column
  # and max to the last):
  l1 <- pmin(dt1[[2]], dt1[[3]])
  l2 <- pmax(dt1[[2]], dt1[[3]])
  set(dt1, j = 2L, value = l1)
  set(dt1, j = 3L, value = l2)
  # now lets clear memory and take unique rows of dt1:
  rm(l1, l2, dt)
  dt1 <- unique(dt1)
  gc()
  # change name for group column:
  setnames(dta, "group", "g2")
  # second merge:
  dt2 <- dt1[dta, on = "p2", allow.cartesian = TRUE, nomatch = 0]
  rm(dt1)
  gc()
  setnames(dta, "p2", "p3")
  dt3 <- dt2[dta, on = "g2", allow.cartesian = TRUE, nomatch = 0] # third merge
  rm(dt2)
  gc()
  dt3 <- dt3[p3 != p2 & p3 != person] # removing equal persons
  gc()
  dt3 <- dt3[, .(person, p2, p3)]
  gc()
  return(dt3[])
}

在小数据集示例中:

df <- data.frame(group = c("a","a","b","b","b","c"),
                 person = c("Tom","Jerry","Tom","Anna","Sam","Nic"),
                 stringsAsFactors = FALSE)
df
myFun(df)
#    person  p2    p3
# 1:   Anna Tom Jerry
# 2:    Sam Tom Jerry
# 3:  Jerry Tom  Anna
# 4:    Sam Tom  Anna
# 5:  Jerry Tom   Sam
# 6:   Anna Tom   Sam
# 7:   Anna Sam   Tom

类似于您的结果但不完全相同的

现在有了更大的数据:

set.seed(33)
N <- 10e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
                 person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # 13.22 sek

rm(results)
gc()

set.seed(33)
N <- 14e6
dt <- data.frame(group = sample(3.7e6, N, replace = TRUE),
                 person = sample(6.8e6, N, replace = TRUE))
system.time(results <- myFun(dt)) # around 40 sek, but RAM does get used to max

更新

也许你可以试试这个分裂的aproch,让我们说nparts 6-10?:

myFunNew3 <- function(dt, nparts = 2) {
  require(data.table)
  setDT(dt)
  setkey(dt, group, person)
  dta <- copy(dt)

  # split into N parts
  splits <- rep(1:nparts, each = ceiling(dt[, .N]/nparts))
  set(dt, j = "splits", value = splits)
  dtl <- split(dt, by = "splits", keep.by = F)
  set(dt, j = "splits", value = NULL)
  rm(splits)
  gc()

  i = 1
  for (i in seq_along(dtl)) {
    X <- copy(dtl[[i]])
    setnames(dta, c("group", "person"))
    X <- X[dta, on = "group", allow.cartesian = TRUE, nomatch = 0]
    X <- X[person != i.person]
    gc()
    X <- X[dta, on = "person", allow.cartesian = TRUE, nomatch = 0]
    gc()
    setnames(dta, "group", "i.group")
    X <- X[dta, on = "i.group", allow.cartesian = TRUE, nomatch = 0] 
    gc()
    setnames(X, "i.person.1", "pers2")
    setnames(X, "i.person", "pers1" )
    setnames(X, "person", "person_in_common" )
    X <- X[, .(pers1, pers2, person_in_common)]
    gc()
    X <- X[pers1 != pers2 & person_in_common != pers1 & person_in_common != pers2]
    gc()
    name1 <- "pers1"
    name2 <- "pers2"
    l1 <- pmin(X[[name1]], X[[name2]])
    l2 <- pmax(X[[name1]], X[[name2]])
    set(X, j = name1, value = l1)
    set(X, j = name2, value = l2)
    rm(l1, l2)
    gc()
    X <- unique(X)
    gc()
    if (i > 1) {
      X1 <- rbindlist(list(X1, X), use.names = T, fill = T)
      X1 <- unique(X1)
      rm(X)
      gc()
    } else {
      X1 <- copy(X)
    }
    dtl[[i]] <- 0L
    gc()
  }
  rm(dta, dtl)
  gc()
  setkey(X1, pers1, pers2, person_in_common)
  X1[]
}