Question

我有一个包含100个cols的pysaprk数据框：

df1=[(col1,string),(col2,double),(col3,bigint),..so on]

我有另一个pyspark数据帧df2具有相同的col count和col name但数据类型不同。

df2=[(col1,bigint),(col2,double),(col3,string),..so on]

如何使df2中所有cols的数据表格与数据帧df1中各个cols的数据表相同？

它应该迭代发生，如果数据类型匹配则不应该改变

Answer 1

如果您说列名称匹配且列数匹配，那么您只需循环 df1 的schema和 cast < / em> df1

的dataTypes列
df2 = df2.select([F.col(c.name).cast(c.dataType) for c in df1.schema])

Answer 2

您可以使用cast功能：

from pyspark.sql import functions as f

# get schema for each DF
df1_schema=df1.dtypes
df2_schema=df2.dtypes

# iterate through cols to cast columns which differ in type
for (c1, d1), (c2,d2) in zip(df1_schema, df2_schema):
    # check if datatypes are the same, otherwise cast
    if d1!=d2:
        df2=df2.withColumn(c2, f.col(c2).cast(d2))

根据数据帧转换各列的数据类型

2 个答案: