PySpark基于案例陈述加入

时间:2016-12-16 03:23:46

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我想基于SQL case语句加入两个dataFrame,如下所示。请告诉我处理这种情况的最佳方法是什么?

from df1 
     left join df2 d 
      on d."Date1" <= Case when v."DATE2" >= v."DATE3" then df1."col1" else df1."col2" end  

1 个答案:

答案 0 :(得分:0)

我个人会把它放到一个返回布尔值的UDF中。因此,业务逻辑将最终出现在Python代码中,SQL将保持干净:

>>> from pyspark.sql.types import BooleanType

>>> def join_based_on_dates(left_date, date0, date1, col0, col1):
>>>     if(date0 >= date1):
>>>         right_date = col0
>>>     else:
>>>         right_date = col1
>>>     return left_date <= right_date

>>> sqlContext.registerFunction("join_based_on_dates", join_based_on_dates, BooleanType())

>>> join_based_on_dates("2016-01-01", "2017-01-01", "2018-01-01", "res1", "res2");
True

>>> sqlContext.sql("SELECT join_based_on_dates('2016-01-01', '2017-01-01', '2018-01-01', 'res1', 'res2')").collect();
[Row(_c0=True)]

您的查询将最终结果如下:

FROM df1
LEFT JOIN df2 ON join_based_on_dates('2016-01-01', '2017-01-01', '2018-01-01', 'res1', 'res2')

希望这有帮助,与Spark玩得开心!