我想使用spark withColumnRenamed函数更改两列的名称。当然,我可以写:
data = sqlContext.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
data = (data
.withColumnRenamed('x1','x3')
.withColumnRenamed('x2', 'x4'))
但我希望一步完成(具有新名称的列表/元组)。不幸的是,这两个:
data = data.withColumnRenamed(['x1', 'x2'], ['x3', 'x4'])
也不是:
data = data.withColumnRenamed(('x1', 'x2'), ('x3', 'x4'))
正在运作。有可能这样做吗?
答案 0 :(得分:37)
无法使用单个withColumnRenamed
电话。
您可以使用DataFrame.toDF
方法*
data.toDF('x3', 'x4')
或
new_names = ['x3', 'x4']
data.toDF(*new_names)
也可以使用简单的select
重命名:
from pyspark.sql.functions import col
mapping = dict(zip(['x1', 'x2'], ['x3', 'x4']))
data.select([col(c).alias(mapping.get(c, c)) for c in data.columns])
同样在Scala中你可以:
重命名所有列:
val newNames = Seq("x3", "x4")
data.toDF(newNames: _*)
使用select
val mapping = Map("x1" -> "x3", "x2" -> "x4")
df.select(
df.columns.map(c => df(c).alias(mapping.get(c).getOrElse(c))): _*
)
或foldLeft
+ withColumnRenamed
mapping.foldLeft(data){
case (data, (oldName, newName)) => data.withColumnRenamed(oldName, newName)
}
*不要与不是可变函数的RDD.toDF
混淆,并将列名作为列表,
答案 1 :(得分:7)
我找不到一个简单的pyspark解决方案,所以只需构建我自己的解决方案,类似于pandas'df.rename(columns={'old_name_1':'new_name_1', 'old_name_2':'new_name_2'})
。
def rename_columns(df, columns):
if isinstance(columns, dict):
for old_name, new_name in columns.items():
df = df.withColumnRenamed(old_name, new_name)
return df
else:
raise ValueError("'columns' should be a dict, like {'old_name_1':'new_name_1', 'old_name_2':'new_name_2'}")
所以你的解决方案看起来像data = rename_columns(data, {'x1': 'x3', 'x2': 'x4'})
它为我节省了一些代码,希望它也会帮助你。
答案 2 :(得分:6)
如果您想使用带有前缀的相同列名来重命名多个列,这应该可以工作
df.select([f.col(c).alias(PREFIX + c) for c in columns])
答案 3 :(得分:1)
我的所有pyspark程序都包含此技巧:
import pyspark
def rename_sdf(df, mapper={}, **kwargs_mapper):
''' Rename column names of a dataframe
mapper: a dict mapping from the old column names to new names
Usage:
df.rename({'old_col_name': 'new_col_name', 'old_col_name2': 'new_col_name2'})
df.rename(old_col_name=new_col_name)
'''
for before, after in mapper.items():
df = df.withColumnRenamed(before, after)
for before, after in kwargs_mapper.items():
df = df.withColumnRenamed(before, after)
return df
pyspark.sql.dataframe.DataFrame.rename = rename_sdf
现在,您可以轻松地以熊猫方式重命名任何spark数据框!
df.rename({'old1':'new1', 'old2':'new2'})
答案 4 :(得分:1)
zero323接受的答案很有效。其他大多数答案都应避免。
这是另一个利用quinn库的高效解决方案,非常适合生产代码库:
df = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
def rename_col(s):
mapping = {'x1': 'x3', 'x2': 'x4'}
return mapping[s]
actual_df = df.transform(quinn.with_columns_renamed(rename_col))
actual_df.show()
这是输出的DataFrame:
+---+---+
| x3| x4|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
让我们看一下actual_df.explain(True)
输出的逻辑计划,并验证它们是否有效:
== Parsed Logical Plan ==
'Project ['x1 AS x3#52, 'x2 AS x4#53]
+- LogicalRDD [x1#48L, x2#49L], false
== Analyzed Logical Plan ==
x3: bigint, x4: bigint
Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
+- LogicalRDD [x1#48L, x2#49L], false
== Optimized Logical Plan ==
Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
+- LogicalRDD [x1#48L, x2#49L], false
== Physical Plan ==
*(1) Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
已解析的逻辑计划和物理计划基本相等,因此Catalyst并没有进行任何繁重的工作来优化计划。
应避免多次调用withColumnRenamed
,因为它会创建效率低下的解析计划,需要对其进行优化。
让我们看一个不必要的复杂解析计划:
def rename_columns(df, columns):
for old_name, new_name in columns.items():
df = df.withColumnRenamed(old_name, new_name)
return df
def rename_col(s):
mapping = {'x1': 'x3', 'x2': 'x4'}
return mapping[s]
actual_df = rename_columns(df, {'x1': 'x3', 'x2': 'x4'})
actual_df.explain(True)
== Parsed Logical Plan ==
Project [x3#52L, x2#49L AS x4#55L]
+- Project [x1#48L AS x3#52L, x2#49L]
+- LogicalRDD [x1#48L, x2#49L], false
== Analyzed Logical Plan ==
x3: bigint, x4: bigint
Project [x3#52L, x2#49L AS x4#55L]
+- Project [x1#48L AS x3#52L, x2#49L]
+- LogicalRDD [x1#48L, x2#49L], false
== Optimized Logical Plan ==
Project [x1#48L AS x3#52L, x2#49L AS x4#55L]
+- LogicalRDD [x1#48L, x2#49L], false
== Physical Plan ==
*(1) Project [x1#48L AS x3#52L, x2#49L AS x4#55L]
阅读this blog post,以获得有关命名PySpark列名的不同方法的详细说明。
答案 5 :(得分:0)
为什么要单行执行 如果您打印执行计划,则实际上只在一行中完成
data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
data = (data
.withColumnRenamed('x1','x3')
.withColumnRenamed('x2', 'x4'))
data.explain()
输出
== Physical Plan ==
*(1) Project [x1#1548L AS x3#1552L, x2#1549L AS x4#1555L]
+- Scan ExistingRDD[x1#1548L,x2#1549L]
如果要使用列表元组 您可以使用一个简单的地图功能
data = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
new_names = [("x1","x3"),("x2","x4")]
data = data.select(list(
map(lambda old,new:F.col(old).alias(new),*zip(*new_names))
))
data.explain()
仍然有相同的计划
输出
== Physical Plan ==
*(1) Project [x1#1650L AS x3#1654L, x2#1651L AS x4#1655L]
+- Scan ExistingRDD[x1#1650L,x2#1651L]
答案 6 :(得分:0)
最简单的方法如下:
说明:
from pyspark.sql import functions as F
(df
.select(*[F.col(c).alias(f"{c}_x") for c in df.columns])
.toPandas().head()
)
希望这会有所帮助