下表没有特定顺序的员工和工资。我试图得到一个输出,其中前30%得到一个值"高"接下来的40%获得平均值#34;其余的都是"低"
Employee Salary
Tony 50000
Alan 45000
Lee 60000
David 35000
Steve 65000
Paul 48000
Micky 62000
George 80000
Nigel 64000
John 42000
输出:
Employee Salary Percentage
Tony 50000 Average
Alan 45000 Low
Lee 60000 Average
David 35000 Low
Steve 65000 High
Paul 48000 Average
Micky 62000 Average
George 80000 High
Nigel 64000 High
John 42000 Low
非常感谢任何帮助!
答案 0 :(得分:4)
您可以按如下方式实施:
import org.apache.spark.sql.functions.percent_rank
import org.apache.spark.sql.expressions.Window
dataDF.show
+--------+------+
|Employee|Salary|
+--------+------+
| Tony| 50000|
| Alan| 45000|
| Lee| 60000|
| David| 35000|
| Steve| 65000|
| Paul| 48000|
| Micky| 62000|
| George| 80000|
| Nigel| 64000|
| John| 42000|
+--------+------+
val window = Window.partitionBy().orderBy(dataDF("Salary"))
dataDF.withColumn("rank",
percent_rank().over(window).alias("rank")).withColumn("Percentage",
when($"rank" > 0.7, "High").when($"rank" <= 0.7 && $"rank" > 0.3,
"Average").otherwise("Low")).drop("rank").show
+--------+------+----------+
|Employee|Salary|Percentage|
+--------+------+----------+
| David| 35000| Low|
| John| 42000| Low|
| Alan| 45000| Low|
| Paul| 48000| Average|
| Tony| 50000| Average|
| Lee| 60000| Average|
| Micky| 62000| Average|
| Nigel| 64000| High|
| Steve| 65000| High|
| George| 80000| High|
+--------+------+----------+
答案 1 :(得分:4)
可以使用Window
函数percent_rank
来完成此操作。但是,它需要在Salary列之后对数据帧进行排序。 percent_rank
函数将为每行提供一个百分比值,具体取决于排序顺序,更具体地说,给定的值为:
(其分区中的行的等级 - 1)/(分区中的行数 - 1)
假设原始数据框为df
:
val df2 = df.withColumn("Percentage", percent_rank over Window.orderBy("Salary"))
.withColumn("Percentage", when($"Percentage" > 0.7, "High").
when($"Percentage" < 0.3, "Low").
otherwise("Average"))
使用问题数据的结果将是:
+--------+------+----------+
|Employee|Salary|Percentage|
+--------+------+----------+
| David| 35000| Low|
| John| 42000| Low|
| Alan| 45000| Low|
| Paul| 48000| Average|
| Tony| 50000| Average|
| Lee| 60000| Average|
| Micky| 62000| Average|
| Nigel| 64000| High|
| Steve| 65000| High|
| George| 80000| High|
+--------+------+----------+