如何在连接后在Pyspark Dataframe中选择和排序多个列

时间:2016-11-07 14:21:39

标签: python apache-spark pyspark apache-spark-sql

我想从现有数据框(在连接后创建)中选择多个列,并希望将fileds命令为目标表结构。怎么做到呢 ?接下来我使用的是下面的。在这里,我可以选择所需的必要列但不能按顺序排列。

Required (Target Table structure) :
hist_columns = ("acct_nbr","account_sk_id", "zip_code","primary_state", "eff_start_date" ,"eff_end_date","eff_flag")

account_sk_df = hist_process_df.join(broadcast(df_sk_lkp) ,'acct_nbr','inner' )
account_sk_df_ld = account_sk_df.select([c for c in account_sk_df.columns if c in hist_columns])

>>> account_sk_df
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, hash_sk_id: string, account_sk_id: int]


>>> account_sk_df_ld
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, account_sk_id: int]

account_sk_id需要排在第二位。最好的方法是什么?

1 个答案:

答案 0 :(得分:14)

尝试通过提供列表来选择列,而不是通过迭代存在列或排序应该没问题:

account_sk_df_ld = account_sk_df.select(*hist_columns)