如何计算两个数据帧之间的增量?

时间:2020-01-27 11:36:20

标签: scala dataframe apache-spark

我想计算两个表(当前已满和昨天已满)之间的增量。

val df_current_full := spark.sql("select * from current_full")
val df_previous_full := spark.sql("select * from previous_full")

我在密钥上的df_current_full和df_previous_full之间进行了完全外部联接。

val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable
  .join(df_previousFullCurrentView, df_currentFullTable(key) ===
    df_previousFullCurrentView(key), "full_outer")

为了知道是否删除或创建了行,我可以简单地做一下:

val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable
  .join(df_previousFullCurrentView, df_currentFullTable(key) === df_previousFullCurrentView(key), "full_outer")
  .withColumn("flagCreatedDeleted", UDF_udfCreateFlagCreatedDeleted(df_currentFullTable(key),
    df_previousFullCurrentView(key)))

val UDF_udfCreateFlagCreatedDeleted = udf(udfCreateFlagCreatedDeleted _)

def udfCreateFlagCreatedDeleted(df_currentFullTable_key: String, df_currentPreviousTable_key: String): String = {


  if (df_currentFullTable_key == null && df_currentPreviousTable_key != null) return "S"
  else if (df_currentFullTable_key != null && df_currentPreviousTable_key == null) return "C"
  else return null
}

但是我对修改后的行有疑问吗?我该如何找回它们? 我的表中有字符串,整数,日期列。

谢谢您的帮助

如果我这样做,代码将变得很长 我有50列,类型不一样

val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable
  .join(df_previousFullCurrentView, df_currentFullTable(key) === df_previousFullCurrentView(key), "full_outer")

  .withColumn("flagCreatedDeleted", UDF_udfCreateFlagCreatedDeleted(df_currentFullTable(key),
    df_previousFullCurrentView(key)))
  .withColumn("flagModifiedStringNameId", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
    df_previousFullCurrentView(key), df_currentFullTable("name_id"), df_previousFullCurrentView("name_id")))
  .withColumn("flagModifiedStringSurname", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
    df_previousFullCurrentView(key), df_currentFullTable("Surname"), df_previousFullCurrentView("Surname")))
  .withColumn("flagModifiedStringAge", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
    df_previousFullCurrentView(key), df_currentFullTable("Age"), df_previousFullCurrentView("Age")))
  .withColumn("flagModifiedStringWorkingE", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
    df_previousFullCurrentView(key), df_currentFullTable("WorkingE"), df_previousFullCurrentView("Working")))

val UDF_udfCreateFlagModifiedString = udf(udfCreateFlagModifiedString _)

def udfCreateFlagModifiedString(df_currentFullTable_key: String, df_currentPreviousTable_key: String,
                                CurrentStringModified: String, PreviousStringModified: String): String = {
  if (df_currentFullTable_key == df_currentPreviousTable_key &&
    CurrentStringModified != PreviousStringModified)
    return "U"

  else return null
}

2 个答案:

答案 0 :(得分:0)

您甚至不需要UDF:如果previous.id为空,则创建该行;如果current.id为空,则将其删除。如果两者都不为空,则意味着该行同时出现在两个数据帧中,因此您可以检查两行的相等性。如果它们不同,则意味着存在更新。

val prev = Seq(Data(1, "foo", "bar"), Data(2, "foo2", "bar2"), Data(3, "foo3", "bar3")).toDF
val curr = Seq(Data(1, "foo", "barNew"), Data(3, "foo3", "bar3"), Data(4, "foo4", "bar4")).toDF

prev.createOrReplaceTempView("previous_full")
curr.createOrReplaceTempView("current_full")

spark.sql("""
  select *,
       (case when previous_full.id is null then 'C'
             when current_full.id is null then 'S'
             when struct(previous_full.*) <> struct(current_full.*) then 'U'
             else null end) as flag
  from previous_full
  full outer join current_full on previous_full.id = current_full.id""").show

/*
+----+----+----+----+----+------+----+
|  id|   x|   y|  id|   x|     y|flag|
+----+----+----+----+----+------+----+
|   1| foo| bar|   1| foo|barNew|   U|
|   3|foo3|bar3|   3|foo3|  bar3|null|
|null|null|null|   4|foo4|  bar4|   C|
|   2|foo2|bar2|null|null|  null|   S|
+----+----+----+----+----+------+----+
*/

答案 1 :(得分:-1)

您可以使用相同的方法:

val isUpdatedColumnUDF = udf(isUpdatedColumn _)

def isUpdatedColumn(currentColumn: String, previousColumn: String): String = 
  if (previousColumn != currentColumn) return "updated"
  else null