PySpark DataFrame根据其他列中的值将列相乘

时间:2019-07-16 00:06:34

标签: pyspark apache-spark-sql

Pyspark新手在这里。我有一个数据框,

+------------+-------+----+
|          id|  mode|count|
+------------+------+-----+
|     146360 |   DOS|   30|
|     423541 |   UNO|    3|
+------------+------+-----+

当模式为aggregate时,我想要一个带有count * 2的新列'DOS'的数据帧,而当模式为count * 1时为'UNO'的数据帧

+------------+-------+----+---------+
|          id|  mode|count|aggregate|
+------------+------+-----+---------+
|     146360 |   DOS|   30|       60|
|     423541 |   UNO|    3|        3|
+------------+------+-----+---------+

赞赏您的意见以及一些指向最佳做法的提示:)

1 个答案:

答案 0 :(得分:1)

方法1 :将pyspark.sql.functionswhen结合使用:

from pyspark.sql.functions import when,col
df = df.withColumn('aggregate', when(col('mode')=='DOS', col('count')*2).when(col('mode')=='UNO', col('count')*1).otherwise('count'))

方法2 :将SQL CASE表达式与selectExpr一起使用:

df = df.selectExpr("*","CASE WHEN mode == 'DOS' THEN count*2 WHEN mode == 'UNO' THEN count*1 ELSE count END AS aggregate")

结果:

+------+----+-----+---------+
|    id|mode|count|aggregate|
+------+----+-----+---------+
|146360| DOS|   30|       60|
|423541| UNO|    3|        3|
+------+----+-----+---------+