Question

我有一个带有一些PySpark列的Python列表，其中包含某些条件。我只希望有一列总结我在列列表中所具有的所有条件。

我尝试使用sum（）操作来合并所有列，但是没有用（很明显）。另外，我一直在检查文档https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html 但是对我来说似乎什么都不起作用。
我正在做这样的事情：

my_condition_list = [col（c）.isNotNull（）for some_of_my_sdf_columns中的c]

这将返回一个不同的Pyspark列的列表，我只需要一个包含所有条件并与|组合在一起的列即可。运算符，因此我可以在.filter（）或.when（）子句中使用它。

谢谢

Answer 1

PySpark不接受有关where/filter条件的列表。它接受string或condition。

您尝试过的方法行不通，您需要调整某些操作才能工作。以下是实现此目的的2种方法-

data = [(("ID1", 3, None)), (("ID2", 4, 12)), (("ID3", None, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
df.show()

from pyspark.sql import functions as F

方式-1

#below change df_name if you have any other name
df_name = "df"
my_condition_list = ["%s['%s'].isNotNull()"%(df_name, c) for c in df.columns]

print (my_condition_list[0])
"df['ID'].isNotNull()"

print (" & ".join(my_condition_list))
"df['ID'].isNotNull() & df['colA'].isNotNull() & df['colB'].isNotNull()"

print (eval(" & ".join(my_condition_list)))
Column<b'(((ID IS NOT NULL) AND (colA IS NOT NULL)) AND (colB IS NOT NULL))'>

df.filter(eval(" & ".join(my_condition_list))).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID2|   4|  12|
+---+----+----+

df.filter(eval(" | ".join(my_condition_list))).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1|   3|null|
|ID2|   4|  12|
|ID3|null|   3|
+---+----+----+

方式-2

my_condition_list = ["%s is not null"%c for c in df.columns]
print (my_condition_list[0])
'ID is not null'

print (" and ".join(my_condition_list))
'ID is not null and colA is not null and colB is not null'

df.filter(" and ".join(my_condition_list)).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID2|   4|  12|
+---+----+----+

df.filter(" or ".join(my_condition_list)).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1|   3|null|
|ID2|   4|  12|
|ID3|null|   3|
+---+----+----+

首选方式是方法2

从许多pyspark列（具有特定条件）到包含所有条件的一列。烟火

1 个答案: