高效加入和分组火花的方式,并最大程度地减少混洗

时间:2018-07-06 02:44:29

标签: scala apache-spark join group-by shuffle

我有两个大型数据框,每个框有大约两百万条记录。

val df1 = Seq(
 ("k1a","k2a", "g1x","g2x")
,("k1b","k2b", "g1x","g2x")
,("k1c","k2c", "g1x","g2y")
,("k1d","k2d", "g1y","g2y")
,("k1e","k2e", "g1y","g2y")
,("k1f","k2f", "g1z","g2y")
).toDF("key1", "key2", "grp1","grp2")

val df2 = Seq(
 ("k1a","k2a", "v4a")
,("k1b","k2b", "v4b")
,("k1c","k2c", "v4c")
,("k1d","k2d", "v4d")
,("k1e","k2e", "v4e")
,("k1f","k2f", "v4f")
).toDF("key1", "key2", "fld4")

我正尝试加入并按如下所示执行groupBy,但要花很多时间才能获得结果。 df1中大约有一百万个grp1 + grp2数据的唯一实例。

val df3 = df1.join(df2,Seq("key1","key2"))
val df4 = df3.groupBy("grp1","grp2").agg(collect_list(struct($"key1",$"key2")).as("dups")).filter("size(dups)>1")

是否有减少混洗的方法? mapPartitions是否适合这两种情况?任何人都可以通过示例提出一种有效的方法。

0 个答案:

没有答案