如何在Pyspark中无需创建新数据框的情况下有效地为数据框的列名添加前缀?

时间:2017-06-15 09:12:58

标签: python apache-spark pyspark spark-dataframe pyspark-sql

在pandas中,您可以在" inplace"中一次性重命名所有列。方式使用

new_column_name_list =['Pre_'+x for x in df.columns]
df.columns = new_column_name_list

我们可以在Pyspark中执行上述相同步骤而无需最终创建新数据帧吗?这是低效的,因为我们将有2个数据帧具有相同的数据但不同的列名导致不良的内存使用。

以下链接回答了问题,但不在原地。

How to change dataframe column names in pyspark? 编辑 我的问题明显不同于上述链接中的问题

1 个答案:

答案 0 :(得分:1)

This is how you could do it in scala spark Create a map of new column and old column name dynamically and select with alias.

val to = df2.columns.map(col(_))

val from = (1 to to.length).map( i => (s"column$i"))

df2.select(to.zip(from).map { case (x, y) => x.alias(y) }: _*).show

Previouse column names

"age", "names"

After changed

"column1". "column2"

However dataframe cannot be updated since they are immutable, But can bes assigned to new one for the further use. Only used dataframes are loaded in memory so this won't be issue.

Hope this helps