PySpark:withColumn()有两个条件和三个结果

时间:2016-10-20 18:27:08

标签: apache-spark hive pyspark apache-spark-sql hiveql

我正在使用Spark和PySpark。我试图获得相当于以下伪代码的结果:

df = df.withColumn('new_column', 
    IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)

我试图在PySpark中这样做,但我不确定语法。有什么指针吗?我调查了expr()但无法让它发挥作用。

请注意,dfpyspark.sql.dataframe.DataFrame

3 个答案:

答案 0 :(得分:42)

There are a few efficient ways to implement this. Let's start with required imports:

from pyspark.sql.functions import col, expr, when

You can use Hive IF function inside expr:

new_column_1 = expr(
    """IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))"""
)

or when + otherwise:

new_column_2 = when(
    col("fruit1").isNull() | col("fruit2").isNull(), 3
).when(col("fruit1") == col("fruit2"), 1).otherwise(0)

Finally you could use following trick:

from pyspark.sql.functions import coalesce, lit

new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))

With example data:

df = sc.parallelize([
    ("orange", "apple"), ("kiwi", None), (None, "banana"), 
    ("mango", "mango"), (None, None)
]).toDF(["fruit1", "fruit2"])

you can use this as follows:

(df
    .withColumn("new_column_1", new_column_1)
    .withColumn("new_column_2", new_column_2)
    .withColumn("new_column_3", new_column_3))

and the result is:

+------+------+------------+------------+------------+
|fruit1|fruit2|new_column_1|new_column_2|new_column_3|
+------+------+------------+------------+------------+
|orange| apple|           0|           0|           0|
|  kiwi|  null|           3|           3|           3|
|  null|banana|           3|           3|           3|
| mango| mango|           1|           1|           1|
|  null|  null|           3|           3|           3|
+------+------+------------+------------+------------+

答案 1 :(得分:11)

You'll want to use a udf as below

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

def func(fruit1, fruit2):
    if fruit1 == None or fruit2 == None:
        return 3
    if fruit1 == fruit2:
        return 1
    return 0

func_udf = udf(func, IntegerType())
df = df.withColumn('new_column',func_udf(df['fruit1'], df['fruit2']))

答案 2 :(得分:0)

pyspark中的

withColumn函数使您可以创建一个带条件的新变量,添加when和else函数,并且if then else结构可以正常工作。为此,您需要导入sparrsql函数,如下所示:您将看到,如果没有col()函数,下面的代码将无法工作。首先,我们声明一个新列-'new column',然后给出when函数中包含的条件(即fruit1 == fruit2),如果条件为true,则给1;如果为true,则控制转到否则。然后使用isNull()函数处理第二个条件(fruit1或fruit2为Null),如果返回true 3,则返回false;否则,再次检查0作为答案

 from pyspark.sql import functions as F        
   df=df.withColumn('new_column', F.when(F.col('fruit1')==F.col('fruit2),1)
                          .otherwise(F.when((F.col('fruit1').isNull()) |(F.col('fruit2').isNull()),3))
                                    .otherwise(0))