我有一个Pyspark数据框
x1 x2
12 4
8 5
13 2
我想为x1 = 10
的行设置x2 < 5
的上限,例如:
if x2 < 5:
if x1 > 10:
x1 = 10
如何为Pyspark做到这一点?
非常感谢
答案 0 :(得分:0)
这是基本逻辑:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.when
from pyspark.sql.functions import when
df = spark.createDataFrame([(12, 4), (8, 5), (13, 2)]).toDF("x1", "x2")
df\
.withColumn("logic", when(df.x2 < 5, 10)\
.otherwise(when(df.x1 > 10, 10)))\
.show()
+---+---+-----+
| x1| x2|logic|
+---+---+-----+
| 12| 4| 10|
| 8| 5| null|
| 13| 2| 10|
+---+---+-----+
// other logic
from pyspark.sql.functions import when, lit
df\
.withColumn("logic", when((df.x2 < 5) & (df.x1 > 10), lit(10))\
.otherwise(df.x1))\
.show()
+---+---+-----+
| x1| x2|logic|
+---+---+-----+
| 12| 4| 10|
| 8| 5| 8|
| 13| 2| 10|
+---+---+-----+