确定等级时考虑相同价值的物品

时间:2018-11-27 17:07:53

标签: scala apache-spark

在Spark中,我想计算一下值是小于还是等于其他值。我试图通过排名来实现这一目标,但是排名产生 [1,2,2,2,3,4] -> [1,2,2,2,5,6] 而我想要的是 [1,2,2,2,3,4] -> [1,4,4,4,5,6]

我可以通过排名,按等级分组,然后根据组中有多少项来修改等级值来完成此任务。但这有点笨拙,效率很低。有更好的方法吗?

编辑:添加了我要完成的任务的最小示例

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window


object Question extends App {
  val spark = SparkSession.builder.appName("Question").master("local[*]").getOrCreate()

  import spark.implicits._

  val win = Window.orderBy($"nums".asc)

  Seq(1, 2, 2, 2, 3, 4)
    .toDF("nums")
    .select($"nums", rank.over(win).alias("rank"))
    .as[(Int, Int)]
    .groupByKey(_._2)
    .mapGroups((rank, nums) => (rank, nums.toList.map(_._1)))
    .map(x => (x._1 + x._2.length - 1, x._2))
    .flatMap(x => x._2.map(num => (num, x._1)))
    .toDF("nums", "rank")
    .show(false)
}

输出:

+----+----+
|nums|rank|
+----+----+
|1   |1   |
|2   |4   |
|2   |4   |
|2   |4   |
|3   |5   |
|4   |6   |
+----+----+

2 个答案:

答案 0 :(得分:2)

使用窗口功能

scala> val df =  Seq(1, 2, 2, 2, 3, 4).toDF("nums")
df: org.apache.spark.sql.DataFrame = [nums: int]

scala> df.createOrReplaceTempView("tbl")

scala> spark.sql(" with tab1(select nums, rank() over(order by nums) rk, count(*) over(partition by nums) cn from tbl) select nums, rk+cn-1 as rk2 from tab1 ").show(false)
18/11/28 02:20:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
|1   |1  |
|2   |4  |
|2   |4  |
|2   |4  |
|3   |5  |
|4   |6  |
+----+---+


scala>

请注意,df不在任何列上进行分区,因此spark抱怨将所有数据移至单个分区。

EDIT1:

scala> spark.sql(" select nums, rank() over(order by nums) + count(*) over(partition by nums) -1 as rk2 from tbl ").show
18/11/28 23:20:09 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
|   1|  1|
|   2|  4|
|   2|  4|
|   2|  4|
|   3|  5|
|   4|  6|
+----+---+


scala>

EDIT2:

等效的df版本

scala> val df =  Seq(1, 2, 2, 2, 3, 4).toDF("nums")
df: org.apache.spark.sql.DataFrame = [nums: int]

scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._

scala> df.withColumn("rk2", rank().over(Window orderBy 'nums)+ count(lit(1)).over(Window.partitionBy('nums)) - 1 ).show(false)
2018-12-01 11:10:26 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
|1   |1  |
|2   |4  |
|2   |4  |
|2   |4  |
|3   |5  |
|4   |6  |
+----+---+


scala>

答案 1 :(得分:0)

因此,一位朋友指出,如果我只是按降序计算等级,然后对每个等级进行(max_rank + 1) - current_rank。这是一个更有效的实现。

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window


object Question extends App {
  val spark = SparkSession.builder.appName("Question").master("local[*]").getOrCreate()

  import spark.implicits._


  val win = Window.orderBy($"nums".desc)
  val rankings = Seq(1, 2, 2, 2, 3, 4)
    .toDF("nums")
    .select($"nums", rank.over(win).alias("rank"))
    .as[(Int, Int)]

  val maxElement = rankings.select("rank").as[Int].reduce((a, b) => if (a > b) a else b)

  rankings
    .map(x => x.copy(_2 = maxElement - x._2 + 1))
    .toDF("nums", "rank")
    .orderBy("rank")
    .show(false)
}

输出

+----+----+
|nums|rank|
+----+----+
|1   |1   |
|2   |4   |
|2   |4   |
|2   |4   |
|3   |5   |
|4   |6   |
+----+----+