在pyspark中将多列数据类型更改为不同的数据类型

时间:2018-07-25 14:39:09

标签: python pandas apache-spark pyspark databricks

我有一个DataFrame( { // Other fields of the user address: { country : "5b56ecab8cba833c28e0e613" } } ),它由50多个列和不同类型的数据类型组成,例如

df

现在我希望可以一​​次性更改所有一种类型的列,例如

df3.printSchema()


     CtpJobId: string (nullable = true)
 |-- TransformJobStateId: string (nullable = true)
 |-- LastError: string (nullable = true)
 |-- PriorityDate: string (nullable = true)
 |-- QueuedTime: string (nullable = true)
 |-- AccurateAsOf: string (nullable = true)
 |-- SentToDevice: string (nullable = true)
 |-- StartedAtDevice: string (nullable = true)
 |-- ProcessStart: string (nullable = true)
 |-- LastProgressAt: string (nullable = true)
 |-- ProcessEnd: string (nullable = true)
 |-- ClipFirstFrameNumber: string (nullable = true)
 |-- ClipLastFrameNumber: double (nullable = true)
 |-- SourceNamedLocation: string (nullable = true)
 |-- TargetId: string (nullable = true)
 |-- TargetNamedLocation: string (nullable = true)
 |-- TargetDirectory: string (nullable = true)
 |-- TargetFilename: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- AssignedDeviceId: string (nullable = true)
 |-- DeviceResourceId: string (nullable = true)
 |-- DeviceName: string (nullable = true)
 |-- srcDropFrame: string (nullable = true)
 |-- srcDuration: double (nullable = true)
 |-- srcFrameRate: double (nullable = true)
 |-- srcHeight: double (nullable = true)
 |-- srcMediaFormat: string (nullable = true)
 |-- srcWidth: double (nullable = true)

我知道如何像现在一样一步一步地做。

timestamp_type = [
    'PriorityDate', 'QueuedTime', 'AccurateAsOf', 'SentToDevice', 
    'StartedAtDevice', 'ProcessStart', 'LastProgressAt', 'ProcessEnd'
]


integer_type = [
    'ClipFirstFrameNumber', 'ClipLastFrameNumber', 'TargetId', 'srcHeight',
    'srcMediaFormat', 'srcWidth'
]

但是,这看起来很丑陋,很容易错过任何我想更改的列。有什么办法可以写任何函数来处理要更改的相同类型的列列表,因此我可以轻松实现convert_data_type并传递这些列名。 预先感谢

1 个答案:

答案 0 :(得分:2)

应该枚举循环,而不是枚举所有值:

for c in timestamp_type:
    df3 = df3.withColumn(c, df[c].cast(TimestampType()))

for c in integer_type:
    df3 = df3.withColumn(c, df[c].cast(IntegerType()))

或者等效地,您可以使用functools.reduce

from functools import reduce   # not needed in python 2
df3 = reduce(
    lambda df, c: df.withColumn(c, df[c].cast(TimestampType())), 
    timestamp_type,
    df3
)

df3 = reduce(
    lambda df, c: df.withColumn(c, df[c].cast(IntegerType())),
    integer_type,
    df3
)