Question

我想知道是否有办法同时更改PySpark Dataframe的两个（或更多）列。现在我正在使用withColumn，但我不知道这是否意味着条件将被检查两次（对于大型数据帧来说，我的代价太高了）。此代码基本上检查其他两列中的值（对于同一行），并根据它将两列更改为None / null。

   condition =  is_special_id_udf(col("id"))) & should_hide_response_udf(col("response_created"))


     new_df = df.withColumn(
            "response_text",
            when(condition, None)
            .otherwise(col("response_text"))
        )

     new_df = df.withColumn(
            "response_created",
            when(condition, None)
            .otherwise(col("response_created"))
        )

Answer 1

首先，您可以简单地将UDF添加为新列，将其用于计算并删除它：

condition =  is_special_id_udf(col("id"))) & should_hide_response_udf(col("response_created"))

 new_df = df.withColumn("tmp", condition).withColumn(
        "response_text",
        when(col("tmp"), None)
        .otherwise(col("response_text"))
    ).withColumn(
        "response_created",
        when(col("tmp"), None)
        .otherwise(col("response_created"))
    ).drop("tmp")

如果你真的想要生成两列，那么你可以创建一个struct列并将其展平（当然要在select中添加你需要的列）：

new_df = df.withColumn(
        "myStruct",
        when(condition, None)
        .otherwise(struct(col("response_text"), col("response_created")))
    ).select("myStruct.*")

PySpark Dataframe：根据条件同时更改两个列

1 个答案: