我正在尝试使用spark和R查找相似的公司名称。目前,我的数据集有5亿多行,并且将来只会增加。唯一公司名称的数量为75M +。我正在具有15个节点和一个主节点的AWS EMR上运行此程序。火花设置如下:
spark.executor.cores:3
spark.executor.memory:30克
spark.driver.memory:80克
spark.driver.cores:8
spark.memory.fraction:0.1
spark.executor.memory开销:5克
节点数:15个(64核,每个488 GB ram和200GB SSD)每个节点的执行程序数:13个分区乘数:19
为此,我首先找出每个公司名称的SOUNDEX,然后在SOUNDEX和每个公司的前两个字母中进行自我连接。加入后,我想找出两个公司名称之间的Levenshtein距离(在每一行中),如果该距离在限制范围内,则选择最受欢迎的一个,否则,请忽略该行。
问题在于自我加入(据我所知,但我很乐意接受其他建议或替代方法来解决问题)
我尝试将分区乘数增加到100或将spark.executor.cores减少到2。
我还尝试将SOUNDEX和前三个字母的表连接起来。
代码:
火花+ R
dist.frac=0.2
min.dist.float = 0.7
max.dist.float=4
method='dl'
old.col='company_name'
pref = 'fuzzy'
substr.mn = 1
substr.mx = 2
weight=c(1,1,1,1)
new.col = paste(pref, old.col, sep='_')
spark_df =
spark_df %>%
rename(old_col = old.col)
fuzzy_spark_df =
spark_df %>%
group_by(old_col) %>%
summarize(freq = count()) %>%
mutate(character_count = length(old_col)) %>%
mutate(max_dist = pmin(pmax(min.dist.float, dist.frac*character_count), max.dist.float)) %>%
mutate(grpcol = paste(soundex(old_col), '|', substr(old_col, substr.mn, substr.mx))) %>%
filter(character_count >= 2)
fuzzy_spark_df =
fuzzy_spark_df %>%
sdf_persist(persist.type)
fuzzy_spark_df =
fuzzy_spark_df %>%
sdf_repartition(partition_by = c('grpcol'), partitions = n.partitions)
collect(head(fuzzy_spark_df)) # to force compuations
logNote('performing the cross join')
fuzzy_spark_df =
fuzzy_spark_df %>% select(old_col, grpcol) %>%
inner_join(fuzzy_spark_df, by=c('grpcol'), suffix=c('_orig', '_fuzzy'))
fuzzy_spark_df =
fuzzy_spark_df %>%
mutate(distance = levenshtein(old_col_orig, old_col_fuzzy))
fuzzy_spark_df =
fuzzy_spark_df %>%
filter(max_dist >= distance) %>%
group_by(old_col_orig) %>%
filter(freq == max(freq, na.rm = T) & distance == min(distance, na.rm = T)) %>%
arrange(old_col_fuzzy) %>%
filter(row_number() == 1) %>%
select('old_col_orig', 'old_col_fuzzy') %>%
distinct()
spark_df =
spark_df %>%
sdf_broadcast() %>%
left_join(fuzzy_spark_df, by = c('old_col' = 'old_col_orig')) %>%
mutate(old_coenter code herel_fuzzy = ifelse(is.na(old_col_fuzzy) | is.null(old_col_fuzzy), 'blank', old_col_fuzzy)) %>%
rename(!!old.col := 'old_col', !!new.col := 'old_col_fuzzy')
火花数据帧spark_df:
company_name join_col
walgreens W123 | wa
walgreen W123 | wa
walmart W123 | wa
cisco C654 | ci
cicso C654 | ci
carta C986 | ca
输出(在col join_col上的内部连接spark_df)
company_name join_col fuzzy_company_name
walgreens W123 | wa walgreen
walgreens W123 | wa walgreens
walgreens W123 | wa walmart
walgreen W123 | wa walgreen
walgreen W123 | wa walgreens
walgreen W123 | wa walmart
walmart W123 | wa walgreen
walmart W123 | wa walgreens
walmart W123 | wa walmart
cisco C654 | ci cisco
cisco C654 | ci cicso
cicso C654 | ci cicso
cicso C654 | ci cisco
carta C986 | ca carta