我的问题是:
方案1(sudo代码):
large_source_df.cache,
small_source1_df.cache,
small_source2_df.cache
small_source3_df.cache
res1_df = large_source_df.join(broadcast(small_source1_df)).filter(...)
res2_df = large_source_df.join(broadcast(small_source2_df)).filter(...)
res3_df = large_source_df.join(broadcast(small_source3_df)).filter(...)
union_df = res1_df.union(res2_df).union(res3_df).count
在这种情况下,即使已缓存large_source_df,它也会被使用3次,就像对蜂巢表进行了3次扫描一样。
方案2(sudo代码): 如果我更改了代码,并在缓存之间添加了重新分区,
large_source_df.repartition(200, $"userid").cache,
small_source1_df.cache,
small_source2_df.cache
small_source3_df.cache
res1_df = large_source_df.join(broadcast(small_source1_df)).filter(...)
res2_df = large_source_df.join(broadcast(small_source2_df)).filter(...)
res3_df = large_source_df.join(broadcast(small_source3_df)).filter(...)
union_df = res1_df.union(res2_df).union(res3_df).count