如何根据受欢迎程度添加包含值的列?

时间:2017-11-09 06:36:11

标签: scala apache-spark apache-spark-sql

下表没有特定顺序的员工和工资。我试图得到一个输出,其中前30%得到一个值"高"接下来的40%获得平均值#34;其余的都是"低"

Employee  Salary
Tony      50000
Alan      45000
Lee       60000
David     35000
Steve     65000
Paul      48000
Micky     62000
George    80000
Nigel     64000
John      42000

输出:

Employee   Salary   Percentage
Tony       50000    Average
Alan       45000    Low
Lee        60000    Average
David      35000    Low
Steve      65000    High
Paul       48000    Average
Micky      62000    Average
George     80000    High
Nigel      64000    High
John       42000    Low

非常感谢任何帮助!

2 个答案:

答案 0 :(得分:4)

您可以按如下方式实施:

import org.apache.spark.sql.functions.percent_rank
import org.apache.spark.sql.expressions.Window

 dataDF.show
+--------+------+
|Employee|Salary|
+--------+------+
|    Tony| 50000|
|    Alan| 45000|
|     Lee| 60000|
|   David| 35000|
|   Steve| 65000|
|    Paul| 48000|
|   Micky| 62000|
|  George| 80000|
|   Nigel| 64000|
|    John| 42000|
+--------+------+

val window = Window.partitionBy().orderBy(dataDF("Salary"))
dataDF.withColumn("rank", 
percent_rank().over(window).alias("rank")).withColumn("Percentage", 
when($"rank" > 0.7, "High").when($"rank" <= 0.7 && $"rank" > 0.3, 
"Average").otherwise("Low")).drop("rank").show

+--------+------+----------+
|Employee|Salary|Percentage|
+--------+------+----------+
|   David| 35000|       Low|
|    John| 42000|       Low|
|    Alan| 45000|       Low|
|    Paul| 48000|   Average|
|    Tony| 50000|   Average|
|     Lee| 60000|   Average|
|   Micky| 62000|   Average|
|   Nigel| 64000|      High|
|   Steve| 65000|      High|
|  George| 80000|      High|
+--------+------+----------+

答案 1 :(得分:4)

可以使用Window函数percent_rank来完成此操作。但是,它需要在Salary列之后对数据帧进行排序。 percent_rank函数将为每行提供一个百分比值,具体取决于排序顺序,更具体地说,给定的值为:

  

(其分区中的行的等级 - 1)/(分区中的行数 - 1)

假设原始数据框为df

val df2 = df.withColumn("Percentage", percent_rank over Window.orderBy("Salary"))
  .withColumn("Percentage", when($"Percentage" > 0.7, "High").
                            when($"Percentage" < 0.3, "Low").
                            otherwise("Average"))

使用问题数据的结果将是:

+--------+------+----------+
|Employee|Salary|Percentage|
+--------+------+----------+
|   David| 35000|       Low|
|    John| 42000|       Low|
|    Alan| 45000|       Low|
|    Paul| 48000|   Average|
|    Tony| 50000|   Average|
|     Lee| 60000|   Average|
|   Micky| 62000|   Average|
|   Nigel| 64000|      High|
|   Steve| 65000|      High|
|  George| 80000|      High|
+--------+------+----------+