Question

我想加入两个数据集，第一个数据集具有4.5 GB，第二个数据集具有5MB。

下面是我的查询，

val data= rdd1.join(rdd2,regexp_replace($"rdd2.SUBSCRIBER_ID","^0*","") === regexp_replace($"rdd1.subscriberid","^0*", "" ) or
    ((substring($"rdd2.FIRST_NAME",0,3) === $"rdd1.firstName") and (substring($"rdd2.LAST_NAME",0,4) === $"rdd1.lastName") and (regexp_replace(substring($"rdd2.BIRTH_DATE",0,10),"-","") === $"rdd1.DOB")) or
    ((substring($"rdd2.FIRST_NAME",0,3) === $"rdd1.firstName") and (substring($"rdd2.LAST_NAME",0,4) === $"rdd1.lastName")  and ($"rdd2.GENDER" === $"rdd1.gender")) or 
    ((substring($"rdd2.FIRST_NAME",0,3) === $"rdd1.firstName") and (regexp_replace(substring($"rdd2.BIRTH_DATE",0,10),"-","") === $"rdd1.DOB") and ($"rdd2.GENDER" === $"rdd1.gender")) or 
    ((substring($"rdd2.LAST_NAME",0,4) === $"rdd1.lastName") and (regexp_replace(substring($"rdd2.BIRTH_DATE",0,10),"-","") === $"rdd1.DOB") and ($"rdd2.GENDER" === $"rdd1.gender")))

它作为笛卡尔联接运行，用于rdd2的广播，但是没有性能。

我正在使用这些属性。

--num-executors 30 --driver-memory 12G --executor-memory 30G  --executor-cores 6

--conf spark.sql.shuffle.partitions=2001 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
--conf spark.cleaner.ttl=800 --conf spark.debug.maxToStringFields=1000

我们确实有300个vcore。

如何更改查询以获得更好的性能

感谢您的帮助。

Answer 1

尝试删除混洗分区配置，并将核心仅保留为2-4。

请尝试让我知道结果。

Answer 2

删除了“ OR”条件，创建了多个数据框并全部进行了合并

val data= rdd1.join(rdd2,regexp_replace($"rdd2.SUBSCRIBER_ID","^0*","") === regexp_replace($"rdd1.subscriberid","^0*", "" )) 
val data1 = rdd1.join(rdd2,((substring($"rdd2.FIRST_NAME",0,3) === $"rdd1.firstName") and (substring($"rdd2.LAST_NAME",0,4) === $"rdd1.lastName") and (regexp_replace(substring($"rdd2.BIRTH_DATE",0,10),"-","") === $"rdd1.DOB")))
val data2 = rdd1.join(rdd2,((substring($"rdd2.FIRST_NAME",0,3) === $"rdd1.firstName") and (substring($"rdd2.LAST_NAME",0,4) === $"rdd1.lastName")  and ($"rdd2.GENDER" === $"rdd1.gender")))
val data3 =  rdd1.join(rdd2,((substring($"rdd2.FIRST_NAME",0,3) === $"rdd1.firstName") and (regexp_replace(substring($"rdd2.BIRTH_DATE",0,10),"-","") === $"rdd1.DOB") and ($"rdd2.GENDER" === $"rdd1.gender")))
val data4= rdd1.join(rdd2,((substring($"rdd2.LAST_NAME",0,4) === $"rdd1.lastName") and (regexp_replace(substring($"rdd2.BIRTH_DATE",0,10),"-","") === $"rdd1.DOB") and ($"rdd2.GENDER" === $"rdd1.gender")))

val finaldata = data union data1 union data2 union data3 union data4

笛卡尔积的Spark性能调整

2 个答案: