pyspark:汇总列中最常见的值

时间:2017-08-11 11:59:43

标签: group-by pyspark aggregate

  aggregrated_table = df_input.groupBy('city', 'income_bracket') \
        .agg(
       count('suburb').alias('suburb'),
       sum('population').alias('population'),
       sum('gross_income').alias('gross_income'),
       sum('no_households').alias('no_households'))

想按城市和收入分组,但在每个城市内某些郊区有不同的收入等级。如何按每个城市最常出现的收入分组进行分组?

例如:



city1 suburb1 income_bracket_10 
city1 suburb1 income_bracket_10 
city1 suburb2 income_bracket_10 
city1 suburb3 income_bracket_11 
city1 suburb4 income_bracket_10 




将按income_bracket_10分组

2 个答案:

答案 0 :(得分:2)

在聚合之前使用窗口函数可能会起到作用:

from pyspark.sql import Window
import pyspark.sql.functions as psf

w = Window.partitionBy('city')
aggregrated_table = df_input.withColumn(
    "count", 
    psf.count("*").over(w)
).withColumn(
    "rn", 
    psf.row_number().over(w.orderBy(psf.desc("count")))
).filter("rn = 1").groupBy('city', 'income_bracket').agg(
   psf.count('suburb').alias('suburb'),
   psf.sum('population').alias('population'),
   psf.sum('gross_income').alias('gross_income'),
   psf.sum('no_households').alias('no_households'))

你也可以在聚合后使用窗口函数,因为你保留了(city,income_bracket)次数的计数。

答案 1 :(得分:1)

您不一定需要Window函数:

@Action(LoginWithCredentials)
async loginWithCredentials({getState, patchState}: StateContext<AuthStateModel>) {

  const {email, password} = getState().forms.login.model;

  const loginAnswer = await yourLoginService.login(email, password);

  if( loginAnswer === 'success' ) {
      patchState( { loggedIn: true };
  }

}

我认为这可以满足您的要求。我真的不知道它的性能是否比基于窗口的解决方案好。