如何在条件下在pyspark上创建新列?

时间:2018-05-28 14:18:49

标签: python python-3.x apache-spark pyspark apache-spark-sql

我在data.frame

中有以下spark
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
from pyspark.sql import functions as sf
from pyspark.sql.functions import col, when, lit

ddf = spark.createDataFrame([[None, 'Michael',2],
                             [30, 'Andy',3],
                             [19, 'Justin',4],
                             [30, 'James Dr No From Russia with Love Bond',6]],
                            schema=['age', 'name','weights'])
ddf.show()

在这个简单的例子中,我想创建两列:一列weighted.mean age如果age>29(名称为weighted_age),另一列{ {1}}如果age^2(名称为age<=29

1 个答案:

答案 0 :(得分:3)

您应首先使用weighted.mean从整个数据集中找到age > 29,然后使用withColumn填充。这是因为weighted.mean取决于整个数据集。

age_squared可以逐行完成

from pyspark.sql import functions as f
weightedMean = ddf.filter(f.col('age')>29).select(f.sum(f.col('age')*f.col('weights'))/f.sum(f.col('weights'))).first()[0]

ddf.withColumn('weighted_age', f.when(f.col('age') > 29, weightedMean))\
    .withColumn('age_squared', f.when(f.col('age') <= 29, f.col('age')*f.col('age')))\
    .show(truncate=False)

应该给你

+----+--------------------------------------+-------+------------+-----------+
|age |name                                  |weights|weighted_age|age_squared|
+----+--------------------------------------+-------+------------+-----------+
|null|Michael                               |2      |null        |null       |
|30  |Andy                                  |3      |30.0        |null       |
|19  |Justin                                |4      |null        |361        |
|30  |James Dr No From Russia with Love Bond|6      |30.0        |null       |
+----+--------------------------------------+-------+------------+-----------+

您可以使用.otherwise功能使用when填充其他值,而不是填充默认null