假设我有一个像这样的spark数据框:
+------------+-----------+
|category |value |
+------------+-----------+
| a| 1|
| a| 2|
| b| 2|
| a| 3|
| b| 4|
| a| 4|
| b| 6|
| b| 8|
+------------+-----------+
我想为每个类别的nan 设置高于0.75%的值。
那个;
a_values = [1,2,3,4] => a_values_filtered = [1,2,3,nan]
b_values = [2,4,6,8] => b_values_filtered = [2,3,6,nan]
所以预期的输出是:
+------------+-----------+
|category |value |
+------------+-----------+
| a| 1|
| a| 2|
| b| 2|
| a| 3|
| b| 4|
| a| nan|
| b| 6|
| b| nan|
+------------+-----------+
有什么主意吗?
PS:我是新来的人
答案 0 :(得分:2)
使用percent_rank
函数获取百分位数,然后使用when
为null
分配大于0.75 percent_rank的值。
from pyspark.sql import Window
from pyspark.sql.functions import percent_rank,when
w = Window.partitionBy(df.category).orderBy(df.value)
percentiles_df = df.withColumn('percentile',percent_rank().over(w))
result = percentiles_df.select(percentiles_df.category
,when(percentiles_df.percentile <= 0.75,percentiles_df.value).alias('value'))
result.show()
答案 1 :(得分:1)
这是另一个类似于Prabhala的答案的片段,我改用percentile_approx
UDF。
from pyspark.sql import Window
import pyspark.sql.functions as F
window = Window.partitionBy('category')
percentile = F.expr('percentile_approx(value, 0.75)')
tmp_df = df.withColumn('percentile_value', percentile.over(window))
result = tmp_df.select('category', when(tmp_df.percentile_value >= tmp_df.value, tmp_df.value).alias('value'))
result.show()
+--------+-----+
|category|value|
+--------+-----+
| b| 2|
| b| 4|
| b| 6|
| b| null|
| a| 1|
| a| 2|
| a| 3|
| a| null|
+--------+-----+