Question

我在pyspark中有一个有15列的数据框。

列名称为id，name，emp.dno，emp.sal，state，emp.city，zip .. ...

现在我想将其中'.'的列名替换为'_'

赞'emp.dno'至'emp_dno'

我想动态地做到这一点

我怎样才能在pyspark中实现这一目标？

Answer 1

您可以使用与this great solution from @zero323类似的内容：

import socket
saddress=input("Server Address: ")
sport=input("Server Port: ")
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind=(saddress,sport)
s.listen(1)
print("Server started! Waiting for connections...")
connection,address=s.accept()
print("Client connected with address:", address)

while 1:
    data=connection.recv(1024)
    if not data: break
    connection.sendall=('--Message receieved --\n')
    print(data.decode('utf-8'))
connection.close()

或者：

<img class="list-thumb-img" [attr.src]="item.image?.url ? item.image.url : 'assets/img/140-100.png'">

然后df.toDF(*(c.replace('.', '_') for c in df.columns))字典看起来像：

from pyspark.sql.functions import col

replacements = {c:c.replace('.','_') for c in df.columns if '.' in c}

df.select([col(c).alias(replacements.get(c, c)) for c in df.columns])

<强>更新

如果我在列名中有空格的数据框，那么如何替换 replacement和{'emp.city': 'emp_city', 'emp.dno': 'emp_dno', 'emp.sal': 'emp_sal'}
的空格

'.'

Answer 2

编写了一个方便快捷的功能供您使用。请享用！：）

def rename_cols(rename_df):
    for column in rename_df.columns:
        new_column = column.replace('.','_')
        rename_df = rename_df.withColumnRenamed(column, new_column)
    return rename_df

Answer 3

最简单的方法如下：

说明：

使用df.columns获取pyspark数据框中的所有列
创建一个列表，循环遍历步骤1中的每一列
列表将输出：col（“ col.1”）。alias（c.replace（'。'，“ _”）。仅对必需的列执行此操作。replace函数有助于替换任何模式。此外，您可以将几列排除在重命名之外
* [list]将解压缩pypsark中select语句的列表

from pyspark.sql import functions as F (df .select(*[F.col(c).alias(c.replace('.',"_")) for c in df.columns]) .toPandas().head() )

希望这会有所帮助

Answer 4

MaxU的回答很好且有效。这篇文章概述了另一种有效的方法，可以帮助保持代码库的干净（使用quinn库）。

假设您具有以下DataFrame：

+---+-----+--------+-------+
| id| name|emp.city|emp.sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

这是在所有列中用下划线替换点的方法。

import quinn

def dots_to_underscores(s):
    return s.replace('.', '_')
actual_df = df.transform(quinn.with_columns_renamed(dots_to_underscores))
actual_df.show()

这是结果actual_df：

+---+-----+--------+-------+
| id| name|emp_city|emp_sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

让我们使用explain()来验证此功能是否有效执行：

actual_df.explain(True)

以下是输出的逻辑计划：

== Parsed Logical Plan ==
'Project ['id AS id#50, 'name AS name#51, '`emp.city` AS emp_city#52, '`emp.sal` AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#29 AS id#50, name#30 AS name#51, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Optimized Logical Plan ==
Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Physical Plan ==
*(1) Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]

您可以看到，已解析的逻辑计划几乎与物理计划相同，因此Catalyst优化器不需要执行太多优化工作。它将id AS id#50转换为id#29，但这并不是很多工作。

with_some_columns_renamed方法可生成更有效的解析计划。

def dots_to_underscores(s):
    return s.replace('.', '_')
def change_col_name(s):
  return '.' in s
actual_df = df.transform(quinn.with_some_columns_renamed(dots_to_underscores, change_col_name))
actual_df.explain(True)

此解析后的计划仅将点加别名。

== Parsed Logical Plan ==
'Project [unresolvedalias('id, None), unresolvedalias('name, None), '`emp.city` AS emp_city#42, '`emp.sal` AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Optimized Logical Plan ==
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Physical Plan ==
*(1) Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]

阅读this blog post了解更多信息，为什么循环遍历DataFrame并多次调用withColumnRenamed会导致过于复杂的解析计划，应该避免。

在PySpark DataFrame中动态重命名多个列

4 个答案: