Pyspark新手在这里。我有一个数据框,
+------------+-------+----+
| id| mode|count|
+------------+------+-----+
| 146360 | DOS| 30|
| 423541 | UNO| 3|
+------------+------+-----+
当模式为aggregate
时,我想要一个带有count * 2
的新列'DOS'
的数据帧,而当模式为count * 1
时为'UNO'
的数据帧
+------------+-------+----+---------+
| id| mode|count|aggregate|
+------------+------+-----+---------+
| 146360 | DOS| 30| 60|
| 423541 | UNO| 3| 3|
+------------+------+-----+---------+
赞赏您的意见以及一些指向最佳做法的提示:)
答案 0 :(得分:1)
方法1 :将pyspark.sql.functions
与when
结合使用:
from pyspark.sql.functions import when,col
df = df.withColumn('aggregate', when(col('mode')=='DOS', col('count')*2).when(col('mode')=='UNO', col('count')*1).otherwise('count'))
方法2 :将SQL CASE表达式与selectExpr
一起使用:
df = df.selectExpr("*","CASE WHEN mode == 'DOS' THEN count*2 WHEN mode == 'UNO' THEN count*1 ELSE count END AS aggregate")
结果:
+------+----+-----+---------+
| id|mode|count|aggregate|
+------+----+-----+---------+
|146360| DOS| 30| 60|
|423541| UNO| 3| 3|
+------+----+-----+---------+