在PySpark DataFrame中动态重命名多个列

时间:2017-01-14 21:24:44

标签: apache-spark dataframe pyspark special-characters

我在pyspark中有一个有15列的数据框。

列名称为idnameemp.dnoemp.salstateemp.cityzip .. ...

现在我想将其中'.'的列名替换为'_'

'emp.dno''emp_dno'

我想动态地做到这一点

我怎样才能在pyspark中实现这一目标?

4 个答案:

答案 0 :(得分:17)

您可以使用与this great solution from @zero323类似的内容:

import socket
saddress=input("Server Address: ")
sport=input("Server Port: ")
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind=(saddress,sport)
s.listen(1)
print("Server started! Waiting for connections...")
connection,address=s.accept()
print("Client connected with address:", address)

while 1:
    data=connection.recv(1024)
    if not data: break
    connection.sendall=('--Message receieved --\n')
    print(data.decode('utf-8'))
connection.close()

或者:

<img class="list-thumb-img" [attr.src]="item.image?.url ? item.image.url : 'assets/img/140-100.png'">

然后df.toDF(*(c.replace('.', '_') for c in df.columns)) 字典看起来像:

from pyspark.sql.functions import col

replacements = {c:c.replace('.','_') for c in df.columns if '.' in c}

df.select([col(c).alias(replacements.get(c, c)) for c in df.columns])

<强>更新

  

如果我在列名中有空格的数据框,那么如何替换   replacement{'emp.city': 'emp_city', 'emp.dno': 'emp_dno', 'emp.sal': 'emp_sal'}

的空格
'.'

答案 1 :(得分:1)

编写了一个方便快捷的功能供您使用。请享用! :)

def rename_cols(rename_df):
    for column in rename_df.columns:
        new_column = column.replace('.','_')
        rename_df = rename_df.withColumnRenamed(column, new_column)
    return rename_df

答案 2 :(得分:0)

最简单的方法如下:

说明:

  1. 使用df.columns获取pyspark数据框中的所有列
  2. 创建一个列表,循环遍历步骤1中的每一列
  3. 列表将输出:col(“ col.1”)。alias(c.replace('。',“ _”)。仅对必需的列执行此操作。replace函数有助于替换任何模式。此外,您可以将几列排除在重命名之外
  4. * [list]将解压缩pypsark中select语句的列表

from pyspark.sql import functions as F (df .select(*[F.col(c).alias(c.replace('.',"_")) for c in df.columns]) .toPandas().head() )

希望这会有所帮助

答案 3 :(得分:0)

MaxU的回答很好且有效。这篇文章概述了另一种有效的方法,可以帮助保持代码库的干净(使用quinn库)。

假设您具有以下DataFrame:

+---+-----+--------+-------+
| id| name|emp.city|emp.sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

这是在所有列中用下划线替换点的方法。

import quinn

def dots_to_underscores(s):
    return s.replace('.', '_')
actual_df = df.transform(quinn.with_columns_renamed(dots_to_underscores))
actual_df.show()

这是结果actual_df

+---+-----+--------+-------+
| id| name|emp_city|emp_sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

让我们使用explain()来验证此功能是否有效执行:

actual_df.explain(True)

以下是输出的逻辑计划:

== Parsed Logical Plan ==
'Project ['id AS id#50, 'name AS name#51, '`emp.city` AS emp_city#52, '`emp.sal` AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#29 AS id#50, name#30 AS name#51, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Optimized Logical Plan ==
Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Physical Plan ==
*(1) Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]

您可以看到,已解析的逻辑计划几乎与物理计划相同,因此Catalyst优化器不需要执行太多优化工作。它将id AS id#50转换为id#29,但这并不是很多工作。

with_some_columns_renamed方法可生成更有效的解析计划。

def dots_to_underscores(s):
    return s.replace('.', '_')
def change_col_name(s):
  return '.' in s
actual_df = df.transform(quinn.with_some_columns_renamed(dots_to_underscores, change_col_name))
actual_df.explain(True)

此解析后的计划仅将点加别名。

== Parsed Logical Plan ==
'Project [unresolvedalias('id, None), unresolvedalias('name, None), '`emp.city` AS emp_city#42, '`emp.sal` AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Optimized Logical Plan ==
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Physical Plan ==
*(1) Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]

阅读this blog post了解更多信息,为什么循环遍历DataFrame并多次调用withColumnRenamed会导致过于复杂的解析计划,应该避免。