我写了一段代码来执行以下操作: 1.从数据帧(df1)为每个层获取n行 2.按层次对行排序 3.用其他数据框中的数据替换其中一列中的数据(df2) 4.联合数据帧(df1和df2)
我知道unionall
在火花中是一项昂贵的操作。是否有替代/更有效和更快速的方法来做同样的事情。谢谢
SeedWindow = Window.orderBy("SeedEmail")
AlphaOutputWindow = Window.partitionBy("col1").orderBy("col2")
seedEmails = (seeds.filter(pos_filter_cond).select("col1....col2")
.distinct().withColumn("row_id",row_number().over(SeedWindow)))
seedCounts = seedEmails.count()
sampleForSeed = (final_result.withColumn("row_id",row_number().over(AlphaOutputWindow))
.filter("row_id <= "+str(seedCounts))
)
sampleAfterSeed = (sampleForSeed.join(seedEmails, ["cols"], "inner"))
finalOutputColumns = [col for col in final_result_moduleCount.columns]
final_result_moduleCount = final_result_moduleCount.select(finalOutputColumns).unionAll(sampleAfterSeed.select(finalOutputColumns))