我需要加入应用Left join的DataFrames。
df1 =
+----------+---------------+
|product_PK| rec_product_PK|
+----------+---------------+
| 560| 630|
| 710| 240|
| 610| 240|
df2 =
+----------+---------------+-----+
|product_PK| rec_product_PK| rank|
+----------+---------------+-----+
| 560| 610| 1|
| 560| 240| 1|
| 610| 240| 0|
问题是df1
只包含500行,而df2
包含600.000.000行和24个分区。我的左连接需要一段时间才能执行。我等了5个小时才完成。
val result = df1.join(df2,Seq("product_PK","rec_product_PK"),"left")
结果应包含500行。我使用以下参数从spark-shell执行代码:
spark-shell -driver-memory 10G --driver-cores 4 --executor-memory 10G --num-executors 2 --executor-cores 4
如何加快这个过程?
更新
df2.explain(true)
的输出:
== Parsed Logical Plan ==
Repartition 5000, true
+- Project [product_PK#15L AS product_PK#195L, product_PK#189L AS reco_product_PK#196L, col2#190 AS rank#197]
+- Project [product_PK#15L, array_elem#184.product_PK AS product_PK#189L, array_elem#184.col2 AS col2#190]
+- Project [product_PK#15L, products#16, array_elem#184]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Analyzed Logical Plan ==
product_PK: bigint, rec_product_PK: bigint, rank: int
Repartition 5000, true
+- Project [product_PK#15L AS product_PK#195L, product_PK#189L AS reco_product_PK#196L, col2#190 AS rank_product_family#197]
+- Project [product_PK#15L, array_elem#184.product_PK AS product_PK#189L, array_elem#184.col2 AS col2#190]
+- Project [product_PK#15L, products#16, array_elem#184]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Optimized Logical Plan ==
Repartition 5000, true
+- Project [product_PK#15L, array_elem#184.product_PK AS rec_product_PK#196L, array_elem#184.col2 AS rank#197]
+- Generate explode(products#16), true, false, [array_elem#184]
+- Relation[product_PK#15L,products#16] parquet
== Physical Plan ==
Exchange RoundRobinPartitioning(5000)
+- *Project [product_PK#15L, array_elem#184.product_PK AS rec_PK#196L, array_elem#184.col2 AS rank#197]
+- Generate explode(products#16), true, false, [array_elem#184]
+- *FileScan parquet [product_PK#15L,products#16] Batched: false, Format: Parquet, Location: InMemoryFileIndex[s3://data/result/2017-11-27/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<product_PK:bigint,products:array<struct<product_PK:bigint,col2:int>>>
答案 0 :(得分:2)
您应该使用不同类型的联接。默认情况下,您所进行的连接假设两个数据帧都很大,因此进行了大量的重排(通常每行都会进行哈希处理,数据将根据哈希值进行混洗,然后执行每个执行程序)。您可以通过在结果上使用说明来查看执行计划来查看此内容。
请考虑使用广播提示:
val result = df2.join(broadcast(df1),Seq("product_PK","rec_product_PK"),"right")
请注意,我翻转了连接顺序,以便广播出现在连接参数中。广播功能是org.apache.spark.sql.functions
的一部分这将进行广播连接,而df1将被复制到所有执行器,并且加入将在本地完成,从而避免需要对大df2进行随机播放。
答案 1 :(得分:1)
鉴于versions:use-latest-releases -DallowMajorUpdates=false
的尺寸非常小,可能值得考虑先将df1
放入列表中,然后将列表中的大collect
过滤为相对较小的df2
数据框,然后用于与left
的{{1}}联接:
df1