如何自联接每个键超过10万行的表?

时间:2019-07-04 15:32:15

标签: apache-spark fuzzy-search sparklyr cross-join

我正在尝试使用spark和R查找相似的公司名称。目前,我的数据集有5亿多行,并且将来只会增加。唯一公司名称的数量为75M +。我正在具有15个节点和一个主节点的AWS EMR上运行此程序。火花设置如下:

spark.executor.cores:3

spark.executor.memory:30克

spark.driver.memory:80克

spark.driver.cores:8

spark.memory.fraction:0.1

spark.executor.memory开销:5克

节点数:15个(64核,每个488 GB ram和200GB SSD)每个节点的执行程序数:13个分区乘数:19

为此,我首先找出每个公司名称的SOUNDEX,然后在SOUNDEX和每个公司的前两个字母中进行自我连接。加入后,我想找出两个公司名称之间的Levenshtein距离(在每一行中),如果该距离在限制范围内,则选择最受欢迎的一个,否则,请忽略该行。

问题在于自我加入(据我所知,但我很乐意接受其他建议或替代方法来解决问题)

我尝试将分区乘数增加到100或将spark.executor.cores减少到2。

我还尝试将SOUNDEX和前三个字母的表连接起来。

代码:

火花+ R



    dist.frac=0.2
    min.dist.float = 0.7
    max.dist.float=4
    method='dl'
    old.col='company_name'
    pref = 'fuzzy'
    substr.mn = 1
    substr.mx = 2
    weight=c(1,1,1,1)

    new.col = paste(pref, old.col, sep='_')

    spark_df =
        spark_df %>%
        rename(old_col = old.col)

    fuzzy_spark_df =
        spark_df %>%
        group_by(old_col) %>%
        summarize(freq = count()) %>%
        mutate(character_count = length(old_col)) %>%
        mutate(max_dist = pmin(pmax(min.dist.float, dist.frac*character_count), max.dist.float)) %>%
        mutate(grpcol = paste(soundex(old_col), '|', substr(old_col, substr.mn, substr.mx))) %>%
        filter(character_count >= 2)

    fuzzy_spark_df =
        fuzzy_spark_df %>%
        sdf_persist(persist.type)

    fuzzy_spark_df =
        fuzzy_spark_df %>%
        sdf_repartition(partition_by = c('grpcol'), partitions = n.partitions)

    collect(head(fuzzy_spark_df)) # to force compuations

    logNote('performing the cross join')

    fuzzy_spark_df =
        fuzzy_spark_df %>% select(old_col, grpcol) %>%
        inner_join(fuzzy_spark_df, by=c('grpcol'), suffix=c('_orig', '_fuzzy'))

    fuzzy_spark_df =
        fuzzy_spark_df %>%
        mutate(distance = levenshtein(old_col_orig, old_col_fuzzy))

    fuzzy_spark_df =
        fuzzy_spark_df %>%
        filter(max_dist >= distance) %>%
        group_by(old_col_orig) %>%
        filter(freq == max(freq, na.rm = T) & distance == min(distance, na.rm = T)) %>%
        arrange(old_col_fuzzy) %>%
        filter(row_number() == 1) %>%
        select('old_col_orig', 'old_col_fuzzy') %>%
        distinct()

    spark_df =
        spark_df %>%
        sdf_broadcast() %>%
        left_join(fuzzy_spark_df, by = c('old_col' = 'old_col_orig')) %>%
        mutate(old_coenter code herel_fuzzy = ifelse(is.na(old_col_fuzzy) | is.null(old_col_fuzzy), 'blank', old_col_fuzzy)) %>%
        rename(!!old.col := 'old_col', !!new.col := 'old_col_fuzzy')

火花数据帧spark_df:

  company_name join_col
  walgreens     W123 | wa
  walgreen      W123 | wa
  walmart       W123 | wa
  cisco         C654 | ci
  cicso         C654 | ci
  carta         C986 | ca

输出(在col join_col上的内部连接spark_df)

   company_name    join_col    fuzzy_company_name
     walgreens     W123 | wa   walgreen
     walgreens     W123 | wa   walgreens
     walgreens     W123 | wa   walmart
     walgreen      W123 | wa   walgreen
     walgreen      W123 | wa   walgreens
     walgreen      W123 | wa   walmart
     walmart       W123 | wa   walgreen
     walmart       W123 | wa   walgreens
     walmart       W123 | wa   walmart
     cisco         C654 | ci   cisco
     cisco         C654 | ci   cicso
     cicso         C654 | ci   cicso
     cicso         C654 | ci   cisco
     carta         C986 | ca   carta

0 个答案:

没有答案