PySpark Dataframe:操作functions.col的组合

时间:2019-07-18 08:40:48

标签: python pyspark pyspark-sql

让我们假设我有一个pyspark数据框,我想以某种方式对其进行过滤。我可以定义一个col对象并将其应用于wherefilter函数。例如:

df.show()

输出:

+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
|       5|       A|       0|
|      10|       B|       1|
|       5|       A|       1|
|      10|       A|       0|
|       5|       B|       2|
+--------+--------+--------+

然后,我可以定义过滤规则(col个对象的组合),如下所示:

import pyspark.sql.functions as F
F_cols = (F.col('feature1')==5) & (F.col('feature2')=='A') & (F.col('feature3')==1)

,然后将其应用于df

df.filter(F_cols).show()

输出:

+--------+--------+--------+
|feature1|feature2|feature3|
+--------+--------+--------+
|       8|       A|       1|
+--------+--------+--------+

然后我可以检查F_cols

F_cols

输出:

Column<b'(((feature1 = 5) AND (feature2 = A)) AND (feature3 = 1))'>

我的问题是如何在不重写所有规则的情况下修改F_col(而不重新定义F_col)?例如,如果我只想更改第二条规则。


到目前为止,我尝试定义一个规则字典,并使用其键值对创建我的F_col

from operator import and_
from functools import reduce
dict_rules = {'feature1':5, 'feature2':'A', 'feature3':1}
F_col = (reduce(and_, ((F.col(x)==y) for x,y in dict_rules.items())))
F_col

输出:

Column<b'(((feature1 = 5) AND (feature2 = A)) AND (feature3 = 1))'>

然后我可以update字典dict_rules并定义新的F_col,如下所示:

dict_rules.update({'feature2':'B'})
F_col = (reduce(and_, ((F.col(x)==y) for x,y in dict_rules.items())))
F_col

输出:

Column<b'(((feature1 = 5) AND (feature2 = B)) AND (feature3 = 1))'>

0 个答案:

没有答案