使用SparkR查找成为主键的变量

时间:2018-11-15 19:16:45

标签: r sparkr sparklyr

这是我的玩具数据:

df <- tibble::tribble(
  ~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
    "A",   "C",    1L,    5L,  "AA",  "AB",    1L,
    "A",   "C",    2L,    5L,  "BB",  "AC",    2L,
    "A",   "D",    1L,    7L,  "AA",  "BC",    2L,
    "A",   "D",    2L,    3L,  "BB",  "CC",    1L,
    "B",   "C",    1L,    8L,  "AA",  "AB",    1L,
    "B",   "C",    2L,    6L,  "BB",  "AC",    2L,
    "B",   "D",    1L,    9L,  "AA",  "BC",    2L,
    "B",   "D",    2L,    6L,  "BB",  "CC",    1L)

以下链接中的原始问题 https://stackoverflow.com/a/53110342/6762788是:

如何获得最小数量的变量的组合,这些变量可以唯一地标识数据帧中的观测值,即哪些变量可以共同构成主键?非常感谢thelatemail,以下答案/代码可以正常工作。

nms <- unlist(lapply(seq_len(length(df)), combn, x=names(df), simplify=FALSE), rec=FALSE)
out <- data.frame(
  vars = vapply(nms, paste, collapse=",", FUN.VALUE=character(1)),
  counts = vapply(nms, function(x) nrow(unique(df[x])), FUN.VALUE=numeric(1))
)

现在,要使其适用于大数据,我想将其带到SparkR。利用此答案,谁能帮助我在SparkR中翻译此代码?如果在SparkR中很难,则在sparklyr中。

1 个答案:

答案 0 :(得分:0)

我将上述问题分解为小段,并尝试了以下SparkR代码。但是,“ counts <-lapply(nms,...”)行似乎非常慢。利用此代码,您是否可以建议通过改进“ counts <-lapply(nms,...”)来进一步提高性能。线。

library(SparkR); library(tidyverse)

df_spark <- mtcars %>% as.DataFrame()

num_m <- seq_len(ncol(df_spark))

nam_list <- SparkR::colnames(df_spark)

combinations <- function(num_m) {
  combn(num_m, x=nam_list, simplify=FALSE)
}

nms <- spark.lapply(num_m, combinations) %>% unlist(rec=FALSE)

vars = map_chr(nms, ~paste(.x, collapse = ","))

counts <- lapply(nms, function(x) df_spark %>% SparkR::select(x) %>% SparkR::distinct() %>% SparkR::count()) %>% unlist()

out <- data.frame(
  vars = vars,
  counts = counts
)

primarykeys <- out %>% 
  dplyr::mutate(n_vars = str_count(vars, ",")+1) %>% 
  dplyr::filter(counts==nrow(df)) %>% 
  dplyr::filter(n_vars==min(n_vars))

primarykeys