Question

除了pyspark之外，有人可以在下面的链接中回答问题吗？

how to fill a column with the value of another column based on a condition on some other columns?

我在这里再次重复这个问题：

假设在pyspark中有一个数据框，如下所示：

col1 | col2 | col3 | col4 
22   | null | 23   |  56
12   |  54  | 22   |  36
48   | null | 2    |  45
76   | 32   | 13   |  6
23   | null | 43   |  8
67   | 54   | 56   |  64
16   | 32   | 32   |  6
3    | 54   | 64   |  8
67   | 4    | 23   |  64

如果col4和col1不是col4<col1，我想用col2替换null的值

所以结果应该是

col1 | col2 | col3 | col4 
22   | null  | 23   |  56
12   |  54   | 22   |  36
48   | null  | 2    |  45
76   | 32    | 13   |  76
23   | null  | 43   |  8
67   | 54    | 56   |  67
16   | 32    | 32   |  16
3    | 54    | 64   |  8
67   | null  | 23   |  64

任何帮助将不胜感激。

Answer 1

这可以解决您的问题：

from pyspark.sql.functions import col, when

condition_col = (col('col4') < col('col1')) & (col('col2').isNotNull())
df = df.withColumn('col4', when(condition_col, col('col1')).otherwise(col('col4')))

when(cond, result1).otherwise(result2)的作用类似于带有列的if / else子句。

对于列逻辑运算符，对&使用and； |代表or； ~代表not。

Answer 2

from pyspark.sql.functions import when, col
values = [(22  ,None ,23  , 56), (12, 54, 22, 36), (48 ,None,2 , 45), (76, 32, 13, 6), (23, None, 43, 8), 
(67, 54, 56, 64), (16, 32, 32, 6), (3, 54, 64, 8), (67, 4, 23, 64)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4'])
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  22|null|  23|  56|
|  12|  54|  22|  36|
|  48|null|   2|  45|
|  76|  32|  13|   6|
|  23|null|  43|   8|
|  67|  54|  56|  64|
|  16|  32|  32|   6|
|   3|  54|  64|   8|
|  67|   4|  23|  64|
+----+----+----+----+

df = df.withColumn('col4',when((col('col4')<col('col1')) & col('col2').isNotNull(),col('col1')).otherwise(col('col4')))
df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  22|null|  23|  56|
|  12|  54|  22|  36|
|  48|null|   2|  45|
|  76|  32|  13|  76|
|  23|null|  43|   8|
|  67|  54|  56|  67|
|  16|  32|  32|  16|
|   3|  54|  64|   8|
|  67|   4|  23|  67|
+----+----+----+----+

如何根据其他某些列的条件用另一列的值填充pyspark数据框中的一列

2 个答案: