让我们假设我有一个pyspark数据框,我想以某种方式对其进行过滤。我可以定义一个col
对象并将其应用于where
或filter
函数。例如:
df.show()
输出:
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
| 5| A| 0|
| 10| B| 1|
| 5| A| 1|
| 10| A| 0|
| 5| B| 2|
+--------+--------+--------+
然后,我可以定义过滤规则(col
个对象的组合),如下所示:
import pyspark.sql.functions as F
F_cols = (F.col('feature1')==5) & (F.col('feature2')=='A') & (F.col('feature3')==1)
,然后将其应用于df
:
df.filter(F_cols).show()
输出:
+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
| 8| A| 1|
+--------+--------+--------+
然后我可以检查F_cols
:
F_cols
输出:
Column<b'(((feature1 = 5) AND (feature2 = A)) AND (feature3 = 1))'>
我的问题是如何在不重写所有规则的情况下修改F_col
(而不重新定义F_col
)?例如,如果我只想更改第二条规则。
到目前为止,我尝试定义一个规则字典,并使用其键值对创建我的F_col
:
from operator import and_
from functools import reduce
dict_rules = {'feature1':5, 'feature2':'A', 'feature3':1}
F_col = (reduce(and_, ((F.col(x)==y) for x,y in dict_rules.items())))
F_col
输出:
Column<b'(((feature1 = 5) AND (feature2 = A)) AND (feature3 = 1))'>
然后我可以update
字典dict_rules
并定义新的F_col
,如下所示:
dict_rules.update({'feature2':'B'})
F_col = (reduce(and_, ((F.col(x)==y) for x,y in dict_rules.items())))
F_col
输出:
Column<b'(((feature1 = 5) AND (feature2 = B)) AND (feature3 = 1))'>