将Spark Scala Dataframe中的行与历史记录合并

时间:2019-07-19 18:21:19

标签: scala apache-spark

在历史上将Spark Dataframe中的行合并,该怎么做?性能也很重要。

基本上,具有相同ID的数据应合并,并且输出中应包含最新更新且不为null的记录,如果所有值均为null,则应保留null。.

另外,建议不要使用SparkSQL Window函数,因为我需要它非常快

Merge rows in a spark scala Dataframe

这是可行的,但对于最新记录,我想要的是我应该能够生成历史记录

我有以下数据

ID  Name    Passport    Country  License    UpdatedtimeStamp
1   Ostrich 12345       -       ABC         11-02-2018
1   -       -           -       BCD         10-02-2018
1   Shah    12345       -       -           12-02-2018
2   PJ      -           ANB     a           10-02-2018

所需的输出是

ID  Name    Passport    Country  License    UpdatedtimeStamp
1   -       -           -       BCD         10-02-2018
1   Ostrich 12345       -       ABC         11-02-2018
1   Shah    12345       -       ABC         12-02-2018
2   PJ      -           ANB     a           10-02-2018

该代码非常适合用于最新记录,但我想知道如何针对历史记录

 import org.apache.spark.sql.functions._
//udf function definition
def sortAndAggUdf = udf((structs: Seq[Row])=>{
  //sorting the collected list by timestamp in descending order
  val sortedStruct = structs.sortBy(str => str.getAs[Long]("UpdatedtimeStamp"))(Ordering[Long].reverse)
  //selecting the first struct and casting to out case class
  val first = out(sortedStruct(0).getAs[String]("Name"), sortedStruct(0).getAs[String]("Passport"), sortedStruct(0).getAs[String]("Country"), sortedStruct(0).getAs[String]("License"), sortedStruct(0).getAs[Long]("UpdatedtimeStamp"))
  //aggregation for checking nulls and populating first not null value
  sortedStruct
    .foldLeft(first)((x, y) => {
      out(
        if(x.Name == null || x.Name.isEmpty) y.getAs[String]("Name") else x.Name,
        if(x.Passport == null || x.Passport.isEmpty) y.getAs[String]("Passport") else x.Passport,
        if(x.Country == null || x.Country.isEmpty) y.getAs[String]("Country") else x.Country,
        if(x.License == null || x.License.isEmpty) y.getAs[String]("License") else x.License,
        x.UpdatedtimeStamp)
    })
})
//making the rest of the columns as one column and changing the UpdatedtimeStamp column to long for sorting in udf
df.select(col("ID"), struct(col("Name"), col("Passport"), col("Country"), col("License"), unix_timestamp(col("UpdatedtimeStamp"), "MM-dd-yyyy").as("UpdatedtimeStamp")).as("struct"))
    //grouping and collecting the structs and passing to udf function for manipulation
    .groupBy("ID").agg(sortAndAggUdf(collect_list("struct")).as("struct"))
    //separating the aggregated columns to separate columns
    .select(col("ID"), col("struct.*"))
    //getting the date in correct format
    .withColumn("UpdatedtimeStamp", date_format(col("UpdatedtimeStamp").cast("timestamp"), "MM-dd-yyyy"))
 .show(false)

case class out(Name: String, Passport: String, Country: String, License: String, UpdatedtimeStamp: Long)

================================================ ========

the above code returns 

ID  Name    Passport    Country  License    UpdatedtimeStamp
1   Shah    12345       -       ABC         12-02-2018
2   PJ      -           ANB     a           10-02-2018

================================================ =========== 我需要这样显示

ID  Name    Passport    Country  License    UpdatedtimeStamp
1   -       -           -       BCD         10-02-2018
1   Ostrich 12345       -       ABC         11-02-2018
1   Shah    12345       -       ABC         12-02-2018
2   PJ      -           ANB     a           10-02-2018

0 个答案:

没有答案