我想计算两个表(当前已满和昨天已满)之间的增量。
val df_current_full := spark.sql("select * from current_full")
val df_previous_full := spark.sql("select * from previous_full")
我在密钥上的df_current_full和df_previous_full之间进行了完全外部联接。
val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable
.join(df_previousFullCurrentView, df_currentFullTable(key) ===
df_previousFullCurrentView(key), "full_outer")
为了知道是否删除或创建了行,我可以简单地做一下:
val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable
.join(df_previousFullCurrentView, df_currentFullTable(key) === df_previousFullCurrentView(key), "full_outer")
.withColumn("flagCreatedDeleted", UDF_udfCreateFlagCreatedDeleted(df_currentFullTable(key),
df_previousFullCurrentView(key)))
val UDF_udfCreateFlagCreatedDeleted = udf(udfCreateFlagCreatedDeleted _)
def udfCreateFlagCreatedDeleted(df_currentFullTable_key: String, df_currentPreviousTable_key: String): String = {
if (df_currentFullTable_key == null && df_currentPreviousTable_key != null) return "S"
else if (df_currentFullTable_key != null && df_currentPreviousTable_key == null) return "C"
else return null
}
但是我对修改后的行有疑问吗?我该如何找回它们? 我的表中有字符串,整数,日期列。
谢谢您的帮助
如果我这样做,代码将变得很长 我有50列,类型不一样
val df_currentFullTableExceptPreviousFullCurrentView: DataFrame = df_currentFullTable
.join(df_previousFullCurrentView, df_currentFullTable(key) === df_previousFullCurrentView(key), "full_outer")
.withColumn("flagCreatedDeleted", UDF_udfCreateFlagCreatedDeleted(df_currentFullTable(key),
df_previousFullCurrentView(key)))
.withColumn("flagModifiedStringNameId", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
df_previousFullCurrentView(key), df_currentFullTable("name_id"), df_previousFullCurrentView("name_id")))
.withColumn("flagModifiedStringSurname", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
df_previousFullCurrentView(key), df_currentFullTable("Surname"), df_previousFullCurrentView("Surname")))
.withColumn("flagModifiedStringAge", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
df_previousFullCurrentView(key), df_currentFullTable("Age"), df_previousFullCurrentView("Age")))
.withColumn("flagModifiedStringWorkingE", UDF_udfCreateFlagModifiedString(df_currentFullTable(key),
df_previousFullCurrentView(key), df_currentFullTable("WorkingE"), df_previousFullCurrentView("Working")))
val UDF_udfCreateFlagModifiedString = udf(udfCreateFlagModifiedString _)
def udfCreateFlagModifiedString(df_currentFullTable_key: String, df_currentPreviousTable_key: String,
CurrentStringModified: String, PreviousStringModified: String): String = {
if (df_currentFullTable_key == df_currentPreviousTable_key &&
CurrentStringModified != PreviousStringModified)
return "U"
else return null
}
答案 0 :(得分:0)
您甚至不需要UDF:如果previous.id
为空,则创建该行;如果current.id
为空,则将其删除。如果两者都不为空,则意味着该行同时出现在两个数据帧中,因此您可以检查两行的相等性。如果它们不同,则意味着存在更新。
val prev = Seq(Data(1, "foo", "bar"), Data(2, "foo2", "bar2"), Data(3, "foo3", "bar3")).toDF
val curr = Seq(Data(1, "foo", "barNew"), Data(3, "foo3", "bar3"), Data(4, "foo4", "bar4")).toDF
prev.createOrReplaceTempView("previous_full")
curr.createOrReplaceTempView("current_full")
spark.sql("""
select *,
(case when previous_full.id is null then 'C'
when current_full.id is null then 'S'
when struct(previous_full.*) <> struct(current_full.*) then 'U'
else null end) as flag
from previous_full
full outer join current_full on previous_full.id = current_full.id""").show
/*
+----+----+----+----+----+------+----+
| id| x| y| id| x| y|flag|
+----+----+----+----+----+------+----+
| 1| foo| bar| 1| foo|barNew| U|
| 3|foo3|bar3| 3|foo3| bar3|null|
|null|null|null| 4|foo4| bar4| C|
| 2|foo2|bar2|null|null| null| S|
+----+----+----+----+----+------+----+
*/
答案 1 :(得分:-1)
您可以使用相同的方法:
val isUpdatedColumnUDF = udf(isUpdatedColumn _)
def isUpdatedColumn(currentColumn: String, previousColumn: String): String =
if (previousColumn != currentColumn) return "updated"
else null