如何通过条件列在spark数据框组上进行贴图缩小?

时间:2019-04-02 03:53:09

标签: scala apache-spark dataframe group-by mapreduce

我的spark数据框如下所示:

+------+------+-------+------+
|userid|useid1|userid2|score |
+------+------+-------+------+
|23    |null  |dsad   |3     |
|11    |44    |null   |4     |
|231   |null  |temp   |5     |
|231   |null  |temp   |2     |
+------+------+-------+------+

我想为每对userid和useid1 / userid2(两者都不为空)进行计算。

如果它是useid1,则将分数乘以5;如果是userid2,则将分数乘以3。

最后,我想为每对加所有分数。

结果应为:

+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23    |dsad    |9          |
|11    |44      |20         |
|231   |temp    |21         |
+------+------+-------------+

我该怎么做?

对于groupBy部分,我知道数据框具有groupBy函数,但是我不知道是否可以有条件地使用它,例如userid1为null,groupby(userid, userid2), userid2为空,groupby(userid, useid1)

对于计算部分,如何根据条件将3或5相乘?

3 个答案:

答案 0 :(得分:1)

以下解决方案将帮助您解决问题。

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

  val groupByUserWinFun = Window.partitionBy("userid","useid1/2")
  val finalScoreDF = userDF.withColumn("useid1/2", when($"userid1".isNull, $"userid2").otherwise($"userid1"))
    .withColumn("finalscore", when($"userid1".isNull, $"score" * 3).otherwise($"score" * 5))
    .withColumn("finalscore", sum("finalscore").over(groupByUserWinFun))
    .select("userid", "useid1/2", "finalscore").distinct()

在spark SQL中使用when方法,选择userid1或2,并根据条件乘以值

输出:

+------+--------+----------+
|userid|useid1/2|finalscore|
+------+--------+----------+
|   11 |      44|      20.0|
|   23 |    dsad|       9.0|
|   231|    temp|      21.0|
+------+--------+----------+

答案 1 :(得分:0)

coalesce将满足需要。

df.withColumn("userid1/2", coalesce(col("useid1"), col("useid1")))

基本上,此函数返回订单的第一个非空值

文档:

COALESCE(T v1, T v2, ...)

Returns the first v that is not NULL, or NULL if all v's are NULL.

需要导入import org.apache.spark.sql.functions.coalesce

答案 2 :(得分:0)

分组依据将起作用:

val original = Seq(
  (23, null, "dsad", 3),
  (11, "44", null, 4),
  (231, null, "temp", 5),
  (231, null, "temp", 2)
).toDF("userid", "useid1", "userid2", "score")

// action
val result = original
  .withColumn("useid1/2", coalesce($"useid1", $"userid2"))
  .withColumn("score", $"score" * when($"useid1".isNotNull, 5).otherwise(3))
  .groupBy("userid", "useid1/2")
  .agg(sum("score").alias("final score"))

result.show(false)

输出:

+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23    |dsad    |9          |
|231   |temp    |21         |
|11    |44      |20         |
+------+--------+-----------+