有一个DataFrame
,需要从另一个DataFrame执行连接。要减少数据,需要选择等于分区的数据查看代码:
// get partition values (like 2017-01-01, 2017-01-02 etc)
val partitionValues = leftDataFrame.someFunctionHere()
rightDataFrame.createOrReplaceTempView("view")
//approximative syntax here
val rightDataFrameReduced = sparkSession
.sql(s"select * from view where my_partition_col IN ($partitionValues)")
rightDataFrameReduced.createOrReplaceTempView("right_df")
leftDataFrame.createOrReplaceTempView("left_df")
//approximative syntax here
sparkSession.sql(s"select * from view right_df joint left_df ON right_df.id = left_df.id")
所以问题是 - 使用什么而不是leftDataFrame.someFunctionHere()
来获取分区值并避免完整的recods扫描?