在其他列上按条件限制pyspark列?

时间:2019-04-22 07:46:04

标签: pyspark

我有一个Pyspark数据框

x1 x2
12 4
8 5
13 2

我想为x1 = 10的行设置x2 < 5的上限,例如:

if x2 < 5:
  if x1 > 10:
    x1 = 10

如何为Pyspark做到这一点?

非常感谢

1 个答案:

答案 0 :(得分:0)

这是基本逻辑:

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.when

from pyspark.sql.functions import when

df = spark.createDataFrame([(12, 4), (8, 5), (13, 2)]).toDF("x1", "x2")

df\
.withColumn("logic", when(df.x2 < 5, 10)\
            .otherwise(when(df.x1 > 10, 10)))\
.show()

+---+---+-----+
| x1| x2|logic|
+---+---+-----+
| 12|  4|   10|
|  8|  5| null|
| 13|  2|   10|
+---+---+-----+

// other logic

from pyspark.sql.functions import when, lit

df\
.withColumn("logic", when((df.x2 < 5) & (df.x1 > 10), lit(10))\
            .otherwise(df.x1))\
.show()

+---+---+-----+
| x1| x2|logic|
+---+---+-----+
| 12|  4|   10|
|  8|  5|    8|
| 13|  2|   10|
+---+---+-----+