Question

我试图在何时以及其他条件下检查多个列值是否为0。我们的Spark数据框的列从1到11，需要检查其值。目前，我的代码如下：-

df3 =df3.withColumn('Status', when((col("1") ==0)|(col("2") ==0)|(col("3") ==0)| (col("4") ==0) |(col("5") ==0)|(col("6") ==0)|(col("7") ==0)| (col("8") ==0)|(col("9") ==0)|(col("10") ==0)| (col("11") ==0) ,'Incomplete').otherwise('Complete'))

如何仅通过使用for循环而不是这么多的or条件来实现这一目标

Answer 1

我提出了一个更多的 pythonic 解决方案。使用class CustomClass: # instance method def add_instance_method(self, a,b): return a + b # classmethod @classmethod def add_class_method(cls, a, b): return a + b # staticmethod @staticmethod def add_static_method(a, b): return a + b和functools.reduce。

operator.or_

这样，您无需定义任何函数，评估字符串表达式或使用python lambda。希望这可以帮助。

Answer 2

有更好的解决方案

>>> df = spark.createDataFrame([(1,0,0,2),(1,1,1,1)],['c1','c2','c3','c4'])
>>> df.show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|  1|  0|  0|  2|
|  1|  1|  1|  1|
+---+---+---+---+

def status(x):
  l = [i for i in x]
  if 0 in l:
    return 'Incomplete'
  else:
    return 'Complete'

>>> df.rdd.map(lambda x:  (x.c1, x.c2, x.c3, x.c4,status(x))).toDF(['c1','c2','c3','c4','status']).show()
+---+---+---+---+----------+
| c1| c2| c3| c4|    status|
+---+---+---+---+----------+
|  1|  0|  0|  2|Incomplete|
|  1|  1|  1|  1|  Complete|
+---+---+---+---+----------+

Answer 3

您可以使用下面的代码来收集您的条件，并将它们连接到单个字符串中，然后调用eval。

代码

cond ='|'.join('(col("'+str(_)+'")==0)' for _ in range(1, 12))

cond = '('+cond+')'

print(cond)

#((col("1")==0)|(col("2")==0)|(col("3")==0)|(col("4")==0)|(col("5")==0)|(col("6")==0)|(col("7")==0)|(col("8")==0)|(col("9")==0)|(col("10")==0)|(col("11")==0))

df3 = df3.withColumn('Status', when(eval(cond),'Incomplete').otherwise('Complete'))

使用pyspark时如何在条件中使用for循环？

3 个答案: