如果我尝试使用之前withColumn
的输出使用withColumn
,则需要数小时才能完成:
df_spark.filter(some_criteria).withColumn(
'Field2', MyCustomUDF('Field1')).withColumn(
'Field3', MyCustomUDF2('Field2')).write.parquet('Parq.parquet')
但是,如果我分开执行,只需几分钟。
#Step 1
df_spark.filter(some_criteria).withColumn(
'Field2',MyCustomUDF('Field1')).write.parquet('TmpFile.parquet')
#Step 2
df_spark2 = spark.read.parquet('TmpFile.parquet')
df_spark2.withColumn(
'Field3',MyCustomUDF2('Field2')).write.parquet('Parq.parquet')