如何使用百分位数过滤数据框以过滤异常值?

时间:2019-08-02 13:56:09

标签: python pyspark outliers percentile

假设我有一个像这样的spark数据框:

+------------+-----------+
|category    |value      |
+------------+-----------+
|           a|          1|
|           a|          2|
|           b|          2|
|           a|          3|
|           b|          4|
|           a|          4|
|           b|          6|
|           b|          8|
+------------+-----------+

我想为每个类别的nan 设置高于0.75%的值

那个;

a_values = [1,2,3,4] => a_values_filtered = [1,2,3,nan]
b_values = [2,4,6,8] => b_values_filtered = [2,3,6,nan]

所以预期的输出是:

+------------+-----------+
|category    |value      |
+------------+-----------+
|           a|          1|
|           a|          2|
|           b|          2|
|           a|          3|
|           b|          4|
|           a|        nan|
|           b|          6|
|           b|        nan|
+------------+-----------+

有什么主意吗?

PS:我是新来的人

2 个答案:

答案 0 :(得分:2)

使用percent_rank函数获取百分位数,然后使用whennull分配大于0.75 percent_rank的值。

from pyspark.sql import Window
from pyspark.sql.functions import percent_rank,when
w = Window.partitionBy(df.category).orderBy(df.value)
percentiles_df = df.withColumn('percentile',percent_rank().over(w))
result = percentiles_df.select(percentiles_df.category
                               ,when(percentiles_df.percentile <= 0.75,percentiles_df.value).alias('value'))
result.show()

答案 1 :(得分:1)

这是另一个类似于Prabhala的答案的片段,我改用percentile_approx UDF。

from pyspark.sql import Window 
import pyspark.sql.functions as F 
window = Window.partitionBy('category') 
percentile = F.expr('percentile_approx(value, 0.75)') 
tmp_df = df.withColumn('percentile_value', percentile.over(window))

result = tmp_df.select('category', when(tmp_df.percentile_value >= tmp_df.value, tmp_df.value).alias('value'))
result.show() 

+--------+-----+
|category|value|
+--------+-----+
|       b|    2|
|       b|    4|
|       b|    6|
|       b| null|
|       a|    1|
|       a|    2|
|       a|    3|
|       a| null|
+--------+-----+