如何动态组合PySpark中的条件

时间:2018-10-01 15:36:47

标签: dynamic filter pyspark

我试图在PySpark中将两个表连接在一起,并且一个连接条件由另一个表中列的内容动态确定。

例如,表1类似于

+-----+-----------+
|Acct |Util_Change|
+-----+-----------+
|1    |0.5        |         
+-----+-----------+
|2    |0.8        |
+-----+-----------+

表2看起来像

+----------+-----------+-----------+
|Low_Change|High_Change|CLS        |
+----------+-----------+-----------+
|>0        |0.3        |T1         | # This means the util_change should be>0 and <=0.3  
+----------+-----------+-----------+
|>0.3      |<0.7       |T2         | # This means the util_change should be>0.3 and <0.7  
+----------+-----------+-----------+
|0.7       |1          |T3         | # This means the util_change should be>=0.7 and <=1  
+----------+-----------+-----------+

我想通过将table1.Util_change与表2中的Low_changeHigh_change进行匹配来联接表1和表2。如您所见,比较运算符由表2定义。

在PySpark中编写代码的最佳方法是什么?

下面是创建两个表的代码:

product = [(1, 0.5), (2, 0.8)]
sp = sqlContext.createDataFrame(product, ["Acct", "Util_Change"])

grid = [('>0', '0.3', 'T1'), ('>0.3', '<0.7', 'T2'), ('0.7', '1', 'T3')]
sp2 = sqlContext.createDataFrame(grid, ["Low_Change", "High_Change", "CLS"]

0 个答案:

没有答案