Question

我有一个PySpark数据框（ input_dataframe ），如下所示：

**id**  **col1**  **col2**  **col3**  **col4** **col_check**
   101      1        0          1         1        -1
   102      0        1          1         0        -1
   103      1        1          0         1        -1
   104      0        0          1         1        -1

我想要一个PySpark函数（ update_col_check ），它会更新此数据帧的列（ col_check ）。我将一个列名称作为参数传递给此函数。函数应检查该列的值是否为 1 ，然后将 col_check 的值更新为此列名称。让我们说我正在传递 col2 作为这个函数的一个参数：

output_dataframe = update_col_check(input_dataframe, col2)

因此，我的 output_dataframe 应如下所示：

**id**  **col1**  **col2**  **col3**  **col4** **col_check**
   101      1        0          1         1        -1
   102      0        1          1         0        col2
   103      1        1          0         1        col2
   104      0        0          1         1        -1

我可以使用PySpark实现这一目标吗？任何帮助将不胜感激。

Answer 1

您可以通过功能when，otherwise：

相当直接地执行此操作

from pyspark.sql.functions import when, lit

def update_col_check(df, col_name):
    return df.withColumn('col_check', when(df[col_name] == 1, lit(col_name)).otherwise(df['col_check']))

update_col_check(df, 'col1').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101|   1|   0|   1|   1|     col1|
|102|   0|   1|   1|   0|       -1|
|103|   1|   1|   0|   1|     col1|
|104|   0|   0|   1|   1|       -1|
+---+----+----+----+----+---------+

update_col_check(df, 'col2').show()
+---+----+----+----+----+---------+
| id|col1|col2|col3|col4|col_check|
+---+----+----+----+----+---------+
|101|   1|   0|   1|   1|       -1|
|102|   0|   1|   1|   0|     col2|
|103|   1|   1|   0|   1|     col2|
|104|   0|   0|   1|   1|       -1|
+---+----+----+----+----+---------+

在PySpark数据帧中添加优先级列

1 个答案: