我的spark数据框如下所示:
+------+------+-------+------+
|userid|useid1|userid2|score |
+------+------+-------+------+
|23 |null |dsad |3 |
|11 |44 |null |4 |
|231 |null |temp |5 |
|231 |null |temp |2 |
+------+------+-------+------+
我想为每对userid和useid1 / userid2(两者都不为空)进行计算。
如果它是useid1,则将分数乘以5;如果是userid2,则将分数乘以3。
最后,我想为每对加所有分数。
结果应为:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|11 |44 |20 |
|231 |temp |21 |
+------+------+-------------+
我该怎么做?
对于groupBy
部分,我知道数据框具有groupBy
函数,但是我不知道是否可以有条件地使用它,例如userid1为null,groupby(userid, userid2)
, userid2为空,groupby(userid, useid1)
。
对于计算部分,如何根据条件将3或5相乘?
答案 0 :(得分:1)
以下解决方案将帮助您解决问题。
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val groupByUserWinFun = Window.partitionBy("userid","useid1/2")
val finalScoreDF = userDF.withColumn("useid1/2", when($"userid1".isNull, $"userid2").otherwise($"userid1"))
.withColumn("finalscore", when($"userid1".isNull, $"score" * 3).otherwise($"score" * 5))
.withColumn("finalscore", sum("finalscore").over(groupByUserWinFun))
.select("userid", "useid1/2", "finalscore").distinct()
在spark SQL中使用when
方法,选择userid1或2,并根据条件乘以值
输出:
+------+--------+----------+
|userid|useid1/2|finalscore|
+------+--------+----------+
| 11 | 44| 20.0|
| 23 | dsad| 9.0|
| 231| temp| 21.0|
+------+--------+----------+
答案 1 :(得分:0)
coalesce
将满足需要。
df.withColumn("userid1/2", coalesce(col("useid1"), col("useid1")))
基本上,此函数返回订单的第一个非空值
文档:
COALESCE(T v1, T v2, ...)
Returns the first v that is not NULL, or NULL if all v's are NULL.
需要导入import org.apache.spark.sql.functions.coalesce
答案 2 :(得分:0)
分组依据将起作用:
val original = Seq(
(23, null, "dsad", 3),
(11, "44", null, 4),
(231, null, "temp", 5),
(231, null, "temp", 2)
).toDF("userid", "useid1", "userid2", "score")
// action
val result = original
.withColumn("useid1/2", coalesce($"useid1", $"userid2"))
.withColumn("score", $"score" * when($"useid1".isNotNull, 5).otherwise(3))
.groupBy("userid", "useid1/2")
.agg(sum("score").alias("final score"))
result.show(false)
输出:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|231 |temp |21 |
|11 |44 |20 |
+------+--------+-----------+