我在data.frame
spark
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
from pyspark.sql import functions as sf
from pyspark.sql.functions import col, when, lit
ddf = spark.createDataFrame([[None, 'Michael',2],
[30, 'Andy',3],
[19, 'Justin',4],
[30, 'James Dr No From Russia with Love Bond',6]],
schema=['age', 'name','weights'])
ddf.show()
在这个简单的例子中,我想创建两列:一列weighted.mean
age
如果age>29
(名称为weighted_age
),另一列{ {1}}如果age^2
(名称为age<=29
)
答案 0 :(得分:3)
您应首先使用weighted.mean
从整个数据集中找到age > 29
,然后使用withColumn
填充。这是因为weighted.mean
取决于整个数据集。
age_squared
可以逐行完成
from pyspark.sql import functions as f
weightedMean = ddf.filter(f.col('age')>29).select(f.sum(f.col('age')*f.col('weights'))/f.sum(f.col('weights'))).first()[0]
ddf.withColumn('weighted_age', f.when(f.col('age') > 29, weightedMean))\
.withColumn('age_squared', f.when(f.col('age') <= 29, f.col('age')*f.col('age')))\
.show(truncate=False)
应该给你
+----+--------------------------------------+-------+------------+-----------+
|age |name |weights|weighted_age|age_squared|
+----+--------------------------------------+-------+------------+-----------+
|null|Michael |2 |null |null |
|30 |Andy |3 |30.0 |null |
|19 |Justin |4 |null |361 |
|30 |James Dr No From Russia with Love Bond|6 |30.0 |null |
+----+--------------------------------------+-------+------------+-----------+
您可以使用.otherwise
功能使用when
填充其他值,而不是填充默认null