Question

假设我有一个带有两列的Pyspark数据框：ID，salary。该数据框具有1亿条记录。我想用列排名代替列薪。列的排名会计算以下薪水的人数。如何有效地做到这一点

例如，给定以下输入数据框：

df = spark.createDataFrame([(1,2000),
                        (2,500),
                        (3,1500)],
                       ['id','salary'])

df.show()

+---+------+
| id|salary|
+---+------+
|  1|  2000|
|  2|   500|
|  3|  1500|
+---+------+

我将得到以下输出：

results.show()

+---+----------+
| id|rank_order|
+---+----------+
|  1|         2|
|  2|         0|
|  3|         1|
+---+----------+

Answer 1

您可以使用window进行排序，然后添加行号，或者将另一种方式转换为rdd，然后最终使用zipWithIndex进行排序。使用窗口：

from pyspark.sql import functions as F
from pyspark.sql.window import Window

window = Window \
             .orderBy(F.col('salary'))
df \
   .withColumn('salary', F.dense_rank().over(window))

Answer 2

一种有效的方法是使用如下的窗口函数。
按薪水排序窗口，并使用当前行之前的所有行。

from pyspark.sql import Window
import pyspark.sql.functions as F

# You study all the rows before the current one. -1 to avoid counting current row
w = Window.orderBy('salary').rowsBetween(Window.unboundedPreceding,Window.currentRow-1)

# Count salary occurences on the window : salary below current salary
results = df.withColumn('rank_order',F.count('salary').over(w))
results.show()

+---+------+----------+
| id|salary|rank_order|
+---+------+----------+
|  2|   500|         0|
|  3|  1500|         1|
|  1|  2000|         2|
+---+------+----------+

将一个列值替换为其他列值的数量减去其自身的数量

2 个答案: