我试图在PySpark中将两个表连接在一起,并且一个连接条件由另一个表中列的内容动态确定。
例如,表1类似于
+-----+-----------+
|Acct |Util_Change|
+-----+-----------+
|1 |0.5 |
+-----+-----------+
|2 |0.8 |
+-----+-----------+
表2看起来像
+----------+-----------+-----------+
|Low_Change|High_Change|CLS |
+----------+-----------+-----------+
|>0 |0.3 |T1 | # This means the util_change should be>0 and <=0.3
+----------+-----------+-----------+
|>0.3 |<0.7 |T2 | # This means the util_change should be>0.3 and <0.7
+----------+-----------+-----------+
|0.7 |1 |T3 | # This means the util_change should be>=0.7 and <=1
+----------+-----------+-----------+
我想通过将table1.Util_change
与表2中的Low_change
和High_change
进行匹配来联接表1和表2。如您所见,比较运算符由表2定义。
在PySpark中编写代码的最佳方法是什么?
下面是创建两个表的代码:
product = [(1, 0.5), (2, 0.8)]
sp = sqlContext.createDataFrame(product, ["Acct", "Util_Change"])
grid = [('>0', '0.3', 'T1'), ('>0.3', '<0.7', 'T2'), ('0.7', '1', 'T3')]
sp2 = sqlContext.createDataFrame(grid, ["Low_Change", "High_Change", "CLS"]