aggregrated_table = df_input.groupBy('city', 'income_bracket') \
.agg(
count('suburb').alias('suburb'),
sum('population').alias('population'),
sum('gross_income').alias('gross_income'),
sum('no_households').alias('no_households'))
想按城市和收入分组,但在每个城市内某些郊区有不同的收入等级。如何按每个城市最常出现的收入分组进行分组?
例如:
city1 suburb1 income_bracket_10
city1 suburb1 income_bracket_10
city1 suburb2 income_bracket_10
city1 suburb3 income_bracket_11
city1 suburb4 income_bracket_10

将按income_bracket_10分组
答案 0 :(得分:2)
在聚合之前使用窗口函数可能会起到作用:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.partitionBy('city')
aggregrated_table = df_input.withColumn(
"count",
psf.count("*").over(w)
).withColumn(
"rn",
psf.row_number().over(w.orderBy(psf.desc("count")))
).filter("rn = 1").groupBy('city', 'income_bracket').agg(
psf.count('suburb').alias('suburb'),
psf.sum('population').alias('population'),
psf.sum('gross_income').alias('gross_income'),
psf.sum('no_households').alias('no_households'))
你也可以在聚合后使用窗口函数,因为你保留了(city,income_bracket)次数的计数。
答案 1 :(得分:1)
您不一定需要Window函数:
@Action(LoginWithCredentials)
async loginWithCredentials({getState, patchState}: StateContext<AuthStateModel>) {
const {email, password} = getState().forms.login.model;
const loginAnswer = await yourLoginService.login(email, password);
if( loginAnswer === 'success' ) {
patchState( { loggedIn: true };
}
}
我认为这可以满足您的要求。我真的不知道它的性能是否比基于窗口的解决方案好。