PySpark - 使用withColumnRenamed重命名多个列

时间:2016-08-05 22:30:53

标签: apache-spark pyspark apache-spark-sql rename

我想使用spark withColumnRenamed函数更改两列的名称。当然,我可以写:

data = sqlContext.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
data = (data
       .withColumnRenamed('x1','x3')
       .withColumnRenamed('x2', 'x4'))

但我希望一步完成(具有新名称的列表/元组)。不幸的是,这两个:

data = data.withColumnRenamed(['x1', 'x2'], ['x3', 'x4'])

也不是:

data = data.withColumnRenamed(('x1', 'x2'), ('x3', 'x4'))

正在运作。有可能这样做吗?

7 个答案:

答案 0 :(得分:37)

无法使用单个withColumnRenamed电话。

  • 您可以使用DataFrame.toDF方法*

    data.toDF('x3', 'x4')
    

    new_names = ['x3', 'x4']
    data.toDF(*new_names)
    
  • 也可以使用简单的select重命名:

    from pyspark.sql.functions import col
    
    mapping = dict(zip(['x1', 'x2'], ['x3', 'x4']))
    data.select([col(c).alias(mapping.get(c, c)) for c in data.columns])
    

同样在Scala中你可以:

  • 重命名所有列:

    val newNames = Seq("x3", "x4")
    
    data.toDF(newNames: _*)
    
  • 使用select

    重命名地图
    val  mapping = Map("x1" -> "x3", "x2" -> "x4")
    
    df.select(
      df.columns.map(c => df(c).alias(mapping.get(c).getOrElse(c))): _*
    )
    

    foldLeft + withColumnRenamed

    mapping.foldLeft(data){
      case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName) 
    }
    

*不要与不是可变函数的RDD.toDF混淆,并将列名作为列表,

答案 1 :(得分:7)

我找不到一个简单的pyspark解决方案,所以只需构建我自己的解决方案,类似于pandas'df.rename(columns={'old_name_1':'new_name_1', 'old_name_2':'new_name_2'})

def rename_columns(df, columns):
    if isinstance(columns, dict):
        for old_name, new_name in columns.items():
            df = df.withColumnRenamed(old_name, new_name)
        return df
    else:
        raise ValueError("'columns' should be a dict, like {'old_name_1':'new_name_1', 'old_name_2':'new_name_2'}")

所以你的解决方案看起来像data = rename_columns(data, {'x1': 'x3', 'x2': 'x4'})

它为我节省了一些代码,希望它也会帮助你。

答案 2 :(得分:6)

如果您想使用带有前缀的相同列名来重命名多个列,这应该可以工作

df.select([f.col(c).alias(PREFIX + c) for c in columns])

答案 3 :(得分:1)

我的所有pyspark程序都包含此技巧:

import pyspark
def rename_sdf(df, mapper={}, **kwargs_mapper):
    ''' Rename column names of a dataframe
        mapper: a dict mapping from the old column names to new names
        Usage:
            df.rename({'old_col_name': 'new_col_name', 'old_col_name2': 'new_col_name2'})
            df.rename(old_col_name=new_col_name)
    '''
    for before, after in mapper.items():
        df = df.withColumnRenamed(before, after)
    for before, after in kwargs_mapper.items():
        df = df.withColumnRenamed(before, after)
    return df
pyspark.sql.dataframe.DataFrame.rename = rename_sdf

现在,您可以轻松地以熊猫方式重命名任何spark数据框!

df.rename({'old1':'new1', 'old2':'new2'})

答案 4 :(得分:1)

zero323接受的答案很有效。其他大多数答案都应避免。

这是另一个利用quinn库的高效解决方案,非常适合生产代码库:

df = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
def rename_col(s):
    mapping = {'x1': 'x3', 'x2': 'x4'}
    return mapping[s]
actual_df = df.transform(quinn.with_columns_renamed(rename_col))
actual_df.show()

这是输出的DataFrame:

+---+---+
| x3| x4|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

让我们看一下actual_df.explain(True)输出的逻辑计划,并验证它们是否有效:

== Parsed Logical Plan ==
'Project ['x1 AS x3#52, 'x2 AS x4#53]
+- LogicalRDD [x1#48L, x2#49L], false

== Analyzed Logical Plan ==
x3: bigint, x4: bigint
Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
+- LogicalRDD [x1#48L, x2#49L], false

== Optimized Logical Plan ==
Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
+- LogicalRDD [x1#48L, x2#49L], false

== Physical Plan ==
*(1) Project [x1#48L AS x3#52L, x2#49L AS x4#53L]

已解析的逻辑计划和物理计划基本相等,因此Catalyst并没有进行任何繁重的工作来优化计划。

应避免多次调用withColumnRenamed,因为它会创建效率低下的解析计划,需要对其进行优化。

让我们看一个不必要的复杂解析计划:

def rename_columns(df, columns):
    for old_name, new_name in columns.items():
        df = df.withColumnRenamed(old_name, new_name)
    return df

def rename_col(s):
    mapping = {'x1': 'x3', 'x2': 'x4'}
    return mapping[s]
actual_df = rename_columns(df, {'x1': 'x3', 'x2': 'x4'})
actual_df.explain(True)
== Parsed Logical Plan ==
Project [x3#52L, x2#49L AS x4#55L]
+- Project [x1#48L AS x3#52L, x2#49L]
   +- LogicalRDD [x1#48L, x2#49L], false

== Analyzed Logical Plan ==
x3: bigint, x4: bigint
Project [x3#52L, x2#49L AS x4#55L]
+- Project [x1#48L AS x3#52L, x2#49L]
   +- LogicalRDD [x1#48L, x2#49L], false

== Optimized Logical Plan ==
Project [x1#48L AS x3#52L, x2#49L AS x4#55L]
+- LogicalRDD [x1#48L, x2#49L], false

== Physical Plan ==
*(1) Project [x1#48L AS x3#52L, x2#49L AS x4#55L]

阅读this blog post,以获得有关命名PySpark列名的不同方法的详细说明。

答案 5 :(得分:0)

为什么要单行执行 如果您打印执行计划,则实际上只在一行中完成

data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
data = (data
   .withColumnRenamed('x1','x3')
   .withColumnRenamed('x2', 'x4'))
data.explain()

输出

== Physical Plan ==
*(1) Project [x1#1548L AS x3#1552L, x2#1549L AS x4#1555L]
+- Scan ExistingRDD[x1#1548L,x2#1549L]

如果要使用列表元组 您可以使用一个简单的地图功能

data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
new_names = [("x1","x3"),("x2","x4")]
data = data.select(list(
       map(lambda old,new:F.col(old).alias(new),*zip(*new_names))
       ))

data.explain()

仍然有相同的计划

输出

== Physical Plan ==
*(1) Project [x1#1650L AS x3#1654L, x2#1651L AS x4#1655L]
+- Scan ExistingRDD[x1#1650L,x2#1651L]

答案 6 :(得分:0)

最简单的方法如下:

说明:

  1. 使用df.columns获取pyspark数据框中的所有列
  2. 创建一个列表,循环遍历步骤1中的每一列
  3. 列表将输出:col(“ col1”)。alias(“ col1_x”)。仅对必需的列执行此操作
  4. * [list]将解压缩pypsark中select语句的列表

from pyspark.sql import functions as F (df .select(*[F.col(c).alias(f"{c}_x") for c in df.columns]) .toPandas().head() )

希望这会有所帮助