我在pyspark中有一个有15列的数据框。
列名称为id
,name
,emp.dno
,emp.sal
,state
,emp.city
,zip
.. ...
现在我想将其中'.'
的列名替换为'_'
赞'emp.dno'
至'emp_dno'
我想动态地做到这一点
我怎样才能在pyspark中实现这一目标?
答案 0 :(得分:17)
您可以使用与this great solution from @zero323类似的内容:
import socket
saddress=input("Server Address: ")
sport=input("Server Port: ")
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind=(saddress,sport)
s.listen(1)
print("Server started! Waiting for connections...")
connection,address=s.accept()
print("Client connected with address:", address)
while 1:
data=connection.recv(1024)
if not data: break
connection.sendall=('--Message receieved --\n')
print(data.decode('utf-8'))
connection.close()
或者:
<img class="list-thumb-img" [attr.src]="item.image?.url ? item.image.url : 'assets/img/140-100.png'">
然后df.toDF(*(c.replace('.', '_') for c in df.columns))
字典看起来像:
from pyspark.sql.functions import col
replacements = {c:c.replace('.','_') for c in df.columns if '.' in c}
df.select([col(c).alias(replacements.get(c, c)) for c in df.columns])
<强>更新强>
如果我在列名中有空格的数据框,那么如何替换
的空格replacement
和{'emp.city': 'emp_city', 'emp.dno': 'emp_dno', 'emp.sal': 'emp_sal'}
'.'
答案 1 :(得分:1)
编写了一个方便快捷的功能供您使用。请享用! :)
def rename_cols(rename_df):
for column in rename_df.columns:
new_column = column.replace('.','_')
rename_df = rename_df.withColumnRenamed(column, new_column)
return rename_df
答案 2 :(得分:0)
最简单的方法如下:
说明:
from pyspark.sql import functions as F
(df
.select(*[F.col(c).alias(c.replace('.',"_")) for c in df.columns])
.toPandas().head()
)
希望这会有所帮助
答案 3 :(得分:0)
MaxU的回答很好且有效。这篇文章概述了另一种有效的方法,可以帮助保持代码库的干净(使用quinn库)。
假设您具有以下DataFrame:
+---+-----+--------+-------+
| id| name|emp.city|emp.sal|
+---+-----+--------+-------+
| 12| bob|New York| 80|
| 99|alice| Atlanta| 90|
+---+-----+--------+-------+
这是在所有列中用下划线替换点的方法。
import quinn
def dots_to_underscores(s):
return s.replace('.', '_')
actual_df = df.transform(quinn.with_columns_renamed(dots_to_underscores))
actual_df.show()
这是结果actual_df
:
+---+-----+--------+-------+
| id| name|emp_city|emp_sal|
+---+-----+--------+-------+
| 12| bob|New York| 80|
| 99|alice| Atlanta| 90|
+---+-----+--------+-------+
让我们使用explain()
来验证此功能是否有效执行:
actual_df.explain(True)
以下是输出的逻辑计划:
== Parsed Logical Plan ==
'Project ['id AS id#50, 'name AS name#51, '`emp.city` AS emp_city#52, '`emp.sal` AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false
== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#29 AS id#50, name#30 AS name#51, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false
== Optimized Logical Plan ==
Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false
== Physical Plan ==
*(1) Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
您可以看到,已解析的逻辑计划几乎与物理计划相同,因此Catalyst优化器不需要执行太多优化工作。它将id AS id#50
转换为id#29
,但这并不是很多工作。
with_some_columns_renamed
方法可生成更有效的解析计划。
def dots_to_underscores(s):
return s.replace('.', '_')
def change_col_name(s):
return '.' in s
actual_df = df.transform(quinn.with_some_columns_renamed(dots_to_underscores, change_col_name))
actual_df.explain(True)
此解析后的计划仅将点加别名。
== Parsed Logical Plan ==
'Project [unresolvedalias('id, None), unresolvedalias('name, None), '`emp.city` AS emp_city#42, '`emp.sal` AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false
== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false
== Optimized Logical Plan ==
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false
== Physical Plan ==
*(1) Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
阅读this blog post了解更多信息,为什么循环遍历DataFrame并多次调用withColumnRenamed
会导致过于复杂的解析计划,应该避免。