我正在寻找连接多个表来获得一个结果单一的非规范化表。下面是一个这样的场景,我有2个表和一个预期的结果表。
表1:
df
表2:
df['A'].notnull()
请注意,两个表中的id值相同。将这些视为事件表,其中在不同时间段发生不同事件。 因此,结果最终表应该包含所有事件,而From和To日期之间没有任何重叠。 如果您在此示例中看到'From Date'和'To Date'之间存在重叠(表1的第1条记录的'To Date'大于表2第1条记录的'From Date'),则结果表的'To'日期'根据最近的下一个日期减去1秒(在这种情况下,2017年1月6日 - 凌晨00:00:00减去1秒)进行更新。
结果:
id From Date To Date User
AA12345 02-Jan-2017 12:00:00 AM 08-Jan-2017 11:59:59 PM LL7R
AA12345 09-Jan-2017 12:00:00 AM 14-Feb-2017 11:59:59 PM AT3B
AA12345 15-Feb-2017 12:00:00 AM 31-Dec-3030 11:59:59 PM UJ5G
我们如何有效地实现这一目标?
答案 0 :(得分:4)
所以您想要的是外部联接,如果列中的值不匹配,则此操作有四种类型,具体取决于具有优先级的表。
在示例中,我们有2个表
表1
+------+--------------------+--------------------+----+
| id| From Date| To Date|User|
+------+--------------------+--------------------+----+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G|
+------+--------------------+--------------------+----+
表2
+------+--------------------+--------------------+--------------------+
| id| From Date| To Date| Associated id|
+------+--------------------+--------------------+--------------------+
|AA1111|03-Jan-2017 12:00...|08-Jan-2017 11:59...| [AA12345]|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...| [AA12345]|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...| [AA12345]|
|AA1114|24-Jan-2017 12:00...|31-Dec-3030 11:59...|[AA12345, AA23456...|
+------+--------------------+--------------------+--------------------+
请注意,表2 中的第一行不仅与表1 中的第一行具有相同的id
,而且From Date
相同}和To Date
值。另一方面,第二行具有相同的id
和To Date
但不同的From Date
。第三行只有相同的id
,第四行完全不同。为简单起见,我们假设此组合涵盖了数据中的所有变体。
现在针对不同类型的加入
如果所有三个值不完全相同,则全外连接将仅创建其他行。它会破坏ID,所以要小心。
val dfFullOuter =
table1
.join( table2, Seq( "id", "From Date", "To Date" ), "outer" )
结果
+------+--------------------+--------------------+----+--------------------+
| id| From Date| To Date|User| Associated id|
+------+--------------------+--------------------+----+--------------------+
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B| null|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G| null|
|AA1114|24-Jan-2017 12:00...|31-Dec-3030 11:59...|null|[AA12345, AA23456...|
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R| [AA12345]|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|null| [AA12345]|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|null| [AA12345]|
+------+--------------------+--------------------+----+--------------------+
您可以看到id
AA1111 的行已成功合并,因为没有冲突的值。其他行只是复制。仅当您完全确信列To Date
和From Date
中的值对于具有相同id
的行相同时,才建议使用此方法。
您也可以仅按id
进行合并,然后决定要优先考虑哪个表格。在此示例中,优先级为表2
val dfFullOuterManual =
table1
.join( table2, Seq( "id" ), "outer" )
.drop( table1( "From Date" ) )
.drop( table1( "To Date" ) )
结果
+------+----+--------------------+--------------------+--------------------+
| id|User| From Date| To Date| Associated id|
+------+----+--------------------+--------------------+--------------------+
|AA1112|AT3B|10-Jan-2017 12:00...|14-Feb-2017 11:59...| [AA12345]|
|AA1111|LL7R|02-Jan-2017 12:00...|08-Jan-2017 11:59...| [AA12345]|
|AA1114|null|24-Jan-2017 12:00...|31-Dec-3030 11:59...|[AA12345, AA23456...|
|AA1113|UJ5G|16-Feb-2017 12:00...|30-Dec-3030 11:59...| [AA12345]|
+------+----+--------------------+--------------------+--------------------+
左外连接将优先考虑表1 中的值,即使只有一个冲突,它也将使用该表中的所有值。请注意,冲突行的Associated id
值为空,因为表1 中没有此类列。此外,id
AA1114 的行也不会被复制。
val dfLeftOuter =
table1
.join( table2, Seq( "id", "From Date", "To Date" ), "left_outer" )
结果
+------+--------------------+--------------------+----+-------------+
| id| From Date| To Date|User|Associated id|
+------+--------------------+--------------------+----+-------------+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R| [AA12345]|
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B| null|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G| null|
+------+--------------------+--------------------+----+-------------+
我们解决了From Date
和To Date
列中的冲突,现在是时候错过Associated id
个值了。为此,我们需要将先前的结果与表2 中的选定值合并。
val dfLeftOuterFinal =
dfLeftOuter
.join( table2.select( "id", "Associated id" ) , Seq( "id" ) )
.drop( dfLeftOuter( "Associated id" ) )
请注意,删除原始Associated id
列是必要的,因为它取自表1 ,主要是 null 。
最终结果
+------+--------------------+--------------------+----+-------------+
| id| From Date| To Date|User|Associated id|
+------+--------------------+--------------------+----+-------------+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R| [AA12345]|
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B| [AA12345]|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G| [AA12345]|
+------+--------------------+--------------------+----+-------------+
右外连接将优先考虑表2 中的数据,并将完全不同的行( AA1114 )添加到结果表中。请注意,冲突行的User
值为空,因为表2 中没有此列。
val dfRightOuter =
table1
.join( table2, Seq( "id", "From Date", "To Date" ), "right_outer" )
结果
+------+--------------------+--------------------+----+--------------------+
| id| From Date| To Date|User| Associated id|
+------+--------------------+--------------------+----+--------------------+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R| [AA12345]|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|null| [AA12345]|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|null| [AA12345]|
|AA1114|24-Jan-2017 12:00...|31-Dec-3030 11:59...|null|[AA12345, AA23456...|
+------+--------------------+--------------------+----+--------------------+
与左外连接一样,我们必须检索缺失值。现在它是User
val dfRightOuterFinal =
dfRightOuter
.join( table1.select( "id", "User" ) , Seq( "id" ) )
.drop( dfRightOuter( "User" ) )
最终结果
+------+--------------------+--------------------+-------------+----+
| id| From Date| To Date|Associated id|User|
+------+--------------------+--------------------+-------------+----+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...| [AA12345]|LL7R|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...| [AA12345]|AT3B|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...| [AA12345]|UJ5G|
+------+--------------------+--------------------+-------------+----+
请注意,id
A1114 的行已消失,因为其中没有User
值。
根据数据优先级,您可以使用此组合为其他列播放。如您所见,这些类型的连接也用于根据您的意图处理数据中的空白。
我的完整测试台代码
import org.apache.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object Main {
def main( args: Array[ String ] ): Unit = {
val spark =
SparkSession
.builder()
.appName( "SO" )
.master( "local[*]" )
.config( "spark.driver.host", "localhost" )
.getOrCreate()
import spark.implicits._
val table1Data = Seq(
( "AA1111", "02-Jan-2017 12:00:00 AM", "08-Jan-2017 11:59:59 PM", "LL7R" ),
( "AA1112", "09-Jan-2017 12:00:00 AM", "14-Feb-2017 11:59:59 PM", "AT3B" ),
( "AA1113", "15-Feb-2017 12:00:00 AM", "31-Dec-3030 11:59:59 PM", "UJ5G" )
)
val table1 =
table1Data
.toDF( "id", "From Date", "To Date", "User" )
val table2Data = Seq(
( "AA1111", "02-Jan-2017 12:00:00 AM", "08-Jan-2017 11:59:59 PM", Seq( "AA12345" ) ),
( "AA1112", "10-Jan-2017 12:00:00 AM", "14-Feb-2017 11:59:59 PM", Seq( "AA12345" ) ),
( "AA1113", "16-Feb-2017 12:00:00 AM", "30-Dec-3030 11:59:59 PM", Seq( "AA12345" ) ),
( "AA1114", "24-Jan-2017 12:00:00 AM", "31-Dec-3030 11:59:59 PM", Seq( "AA12345", "AA234567", "AB56789" ) )
)
val table2 =
table2Data
.toDF( "id", "From Date", "To Date", "Associated id" )
val dfFullOuter =
table1
.join( table2, Seq( "id", "From Date", "To Date" ), "outer" )
val dfFullOuterManual =
table1
.join( table2, Seq( "id" ), "outer" )
.drop( table1( "From Date" ) )
.drop( table1( "To Date" ) )
val dfLeftOuter =
table1
.join( table2, Seq( "id", "From Date", "To Date" ), "left_outer" )
val dfLeftOuterFinal =
dfLeftOuter
.join( table2.select( "id", "Associated id" ) , Seq( "id" ) )
.drop( dfLeftOuter( "Associated id" ) )
val dfRightOuter =
table1
.join( table2, Seq( "id", "From Date", "To Date" ), "right_outer" )
val dfRightOuterFinal =
dfRightOuter
.join( table1.select( "id", "User" ) , Seq( "id" ) )
.drop( dfRightOuter( "User" ) )
spark.stop()
}
}