在pyspark中替代Union

时间:2017-10-11 18:04:52

标签: apache-spark pyspark

我写了一段代码来执行以下操作: 1.从数据帧(df1)为每个层获取n行 2.按层次对行排序 3.用其他数据框中的数据替换其中一列中的数据(df2) 4.联合数据帧(df1和df2)

我知道unionall在火花中是一项昂贵的操作。是否有替代/更有效和更快速的方法来做同样的事情。谢谢

SeedWindow = Window.orderBy("SeedEmail")
AlphaOutputWindow = Window.partitionBy("col1").orderBy("col2")

seedEmails = (seeds.filter(pos_filter_cond).select("col1....col2")
        .distinct().withColumn("row_id",row_number().over(SeedWindow)))

seedCounts = seedEmails.count()

sampleForSeed = (final_result.withColumn("row_id",row_number().over(AlphaOutputWindow))
        .filter("row_id <= "+str(seedCounts))
    )

sampleAfterSeed = (sampleForSeed.join(seedEmails, ["cols"], "inner"))

finalOutputColumns = [col for col in final_result_moduleCount.columns]

final_result_moduleCount = final_result_moduleCount.select(finalOutputColumns).unionAll(sampleAfterSeed.select(finalOutputColumns))

0 个答案:

没有答案