将多个表连接到一个非规范化表中

时间:2018-01-25 16:22:48

标签: python scala apache-spark

我正在寻找连接多个表来获得一个结果单一的非规范化表。下面是一个这样的场景,我有2个表和一个预期的结果表。

表1:

df

表2:

df['A'].notnull()

请注意,两个表中的id值相同。将这些视为事件表,其中在不同时间段发生不同事件。 因此,结果最终表应该包含所有事件,而From和To日期之间没有任何重叠。 如果您在此示例中看到'From Date'和'To Date'之间存在重叠(表1的第1条记录的'To Date'大于表2第1条记录的'From Date'),则结果表的'To'日期'根据最近的下一个日期减去1秒(在这种情况下,2017年1月6日 - 凌晨00:00:00减去1秒)进行更新。

结果:

id       From Date                  To Date                  User
AA12345  02-Jan-2017 12:00:00 AM    08-Jan-2017 11:59:59 PM  LL7R
AA12345  09-Jan-2017 12:00:00 AM    14-Feb-2017 11:59:59 PM  AT3B
AA12345  15-Feb-2017 12:00:00 AM    31-Dec-3030 11:59:59 PM  UJ5G

我们如何有效地实现这一目标?

1 个答案:

答案 0 :(得分:4)

所以您想要的是外部联接,如果列中的值不匹配,则此操作有四种类型,具体取决于具有优先级的表。

在示例中,我们有2个表

表1

+------+--------------------+--------------------+----+
|    id|           From Date|             To Date|User|
+------+--------------------+--------------------+----+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G|
+------+--------------------+--------------------+----+

表2

+------+--------------------+--------------------+--------------------+
|    id|           From Date|             To Date|       Associated id|
+------+--------------------+--------------------+--------------------+
|AA1111|03-Jan-2017 12:00...|08-Jan-2017 11:59...|           [AA12345]|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|           [AA12345]|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|           [AA12345]|
|AA1114|24-Jan-2017 12:00...|31-Dec-3030 11:59...|[AA12345, AA23456...|
+------+--------------------+--------------------+--------------------+

请注意,表2 中的第一行不仅与表1 中的第一行具有相同的id,而且From Date相同}和To Date值。另一方面,第二行具有相同的idTo Date但不同的From Date。第三行只有相同的id,第四行完全不同。为简单起见,我们假设此组合涵盖了数据中的所有变体。

现在针对不同类型的加入

全外连接

如果所有三个值不完全相同,则全外连接将仅创建其他行。它会破坏ID,所以要小心。

val dfFullOuter =
    table1
    .join( table2, Seq( "id", "From Date", "To Date" ), "outer" )

结果

+------+--------------------+--------------------+----+--------------------+
|    id|           From Date|             To Date|User|       Associated id|
+------+--------------------+--------------------+----+--------------------+
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B|                null|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G|                null|
|AA1114|24-Jan-2017 12:00...|31-Dec-3030 11:59...|null|[AA12345, AA23456...|
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|           [AA12345]|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|null|           [AA12345]|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|null|           [AA12345]|
+------+--------------------+--------------------+----+--------------------+

您可以看到id AA1111 的行已成功合并,因为没有冲突的值。其他行只是复制。仅当您完全确信列To DateFrom Date中的值对于具有相同id的行相同时,才建议使用此方法。

您也可以仅按id进行合并,然后决定要优先考虑哪个表格。在此示例中,优先级为表2

val dfFullOuterManual =
    table1
    .join( table2, Seq( "id" ), "outer" )
    .drop( table1( "From Date" ) )
    .drop( table1( "To Date" ) )

结果

+------+----+--------------------+--------------------+--------------------+
|    id|User|           From Date|             To Date|       Associated id|
+------+----+--------------------+--------------------+--------------------+
|AA1112|AT3B|10-Jan-2017 12:00...|14-Feb-2017 11:59...|           [AA12345]|
|AA1111|LL7R|02-Jan-2017 12:00...|08-Jan-2017 11:59...|           [AA12345]|
|AA1114|null|24-Jan-2017 12:00...|31-Dec-3030 11:59...|[AA12345, AA23456...|
|AA1113|UJ5G|16-Feb-2017 12:00...|30-Dec-3030 11:59...|           [AA12345]|
+------+----+--------------------+--------------------+--------------------+

左外连接

左外连接将优先考虑表1 中的值,即使只有一个冲突,它也将使用该表中的所有值。请注意,冲突行的Associated id为空,因为表1 中没有此类列。此外,id AA1114 的行也不会被复制。

val dfLeftOuter =
    table1
    .join( table2, Seq( "id", "From Date", "To Date" ), "left_outer" )

结果

+------+--------------------+--------------------+----+-------------+
|    id|           From Date|             To Date|User|Associated id|
+------+--------------------+--------------------+----+-------------+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|    [AA12345]|
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B|         null|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G|         null|
+------+--------------------+--------------------+----+-------------+

我们解决了From DateTo Date列中的冲突,现在是时候错过Associated id个值了。为此,我们需要将先前的结果与表2 中的选定值合并。

val dfLeftOuterFinal =
    dfLeftOuter
    .join( table2.select( "id", "Associated id" ) , Seq( "id" ) )
    .drop( dfLeftOuter( "Associated id" ) )

请注意,删除原始Associated id列是必要的,因为它取自表1 ,主要是 null

最终结果

+------+--------------------+--------------------+----+-------------+
|    id|           From Date|             To Date|User|Associated id|
+------+--------------------+--------------------+----+-------------+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|    [AA12345]|
|AA1112|09-Jan-2017 12:00...|14-Feb-2017 11:59...|AT3B|    [AA12345]|
|AA1113|15-Feb-2017 12:00...|31-Dec-3030 11:59...|UJ5G|    [AA12345]|
+------+--------------------+--------------------+----+-------------+

右外连接

右外连接将优先考虑表2 中的数据,并将完全不同的行( AA1114 )添加到结果表中。请注意,冲突行的User为空,因为表2 中没有此列。

val dfRightOuter =
    table1
    .join( table2, Seq( "id", "From Date", "To Date" ), "right_outer" )

结果

+------+--------------------+--------------------+----+--------------------+
|    id|           From Date|             To Date|User|       Associated id|
+------+--------------------+--------------------+----+--------------------+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|LL7R|           [AA12345]|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|null|           [AA12345]|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|null|           [AA12345]|
|AA1114|24-Jan-2017 12:00...|31-Dec-3030 11:59...|null|[AA12345, AA23456...|
+------+--------------------+--------------------+----+--------------------+

与左外连接一样,我们必须检索缺失值。现在它是User

val dfRightOuterFinal =
    dfRightOuter
    .join( table1.select( "id", "User" ) , Seq( "id" ) )
    .drop( dfRightOuter( "User" ) )

最终结果

+------+--------------------+--------------------+-------------+----+
|    id|           From Date|             To Date|Associated id|User|
+------+--------------------+--------------------+-------------+----+
|AA1111|02-Jan-2017 12:00...|08-Jan-2017 11:59...|    [AA12345]|LL7R|
|AA1112|10-Jan-2017 12:00...|14-Feb-2017 11:59...|    [AA12345]|AT3B|
|AA1113|16-Feb-2017 12:00...|30-Dec-3030 11:59...|    [AA12345]|UJ5G|
+------+--------------------+--------------------+-------------+----+

请注意,id A1114 的行已消失,因为其中没有User值。

最后的想法

根据数据优先级,您可以使用此组合为其他列播放。如您所见,这些类型的连接也用于根据您的意图处理数据中的空白。

我的完整测试台代码

import org.apache.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object Main {

    def main( args: Array[ String ] ): Unit = {

        val spark =
            SparkSession
            .builder()
            .appName( "SO" )
            .master( "local[*]" )
            .config( "spark.driver.host", "localhost" )
            .getOrCreate()

        import spark.implicits._

        val table1Data = Seq(
            ( "AA1111", "02-Jan-2017 12:00:00 AM", "08-Jan-2017 11:59:59 PM", "LL7R" ),
            ( "AA1112", "09-Jan-2017 12:00:00 AM", "14-Feb-2017 11:59:59 PM", "AT3B" ),
            ( "AA1113", "15-Feb-2017 12:00:00 AM", "31-Dec-3030 11:59:59 PM", "UJ5G" )
        )

        val table1 =
            table1Data
            .toDF( "id", "From Date", "To Date", "User" )

        val table2Data = Seq(
            ( "AA1111", "02-Jan-2017 12:00:00 AM", "08-Jan-2017 11:59:59 PM", Seq( "AA12345" ) ),
            ( "AA1112", "10-Jan-2017 12:00:00 AM", "14-Feb-2017 11:59:59 PM", Seq( "AA12345" ) ),
            ( "AA1113", "16-Feb-2017 12:00:00 AM", "30-Dec-3030 11:59:59 PM", Seq( "AA12345" ) ),
            ( "AA1114", "24-Jan-2017 12:00:00 AM", "31-Dec-3030 11:59:59 PM", Seq( "AA12345", "AA234567", "AB56789" ) )
        )

        val table2 =
            table2Data
            .toDF( "id", "From Date", "To Date", "Associated id" )

        val dfFullOuter =
            table1
            .join( table2, Seq( "id", "From Date", "To Date" ), "outer" )

        val dfFullOuterManual = 
            table1
            .join( table2, Seq( "id" ), "outer" )
            .drop( table1( "From Date" ) )
            .drop( table1( "To Date" ) )

        val dfLeftOuter =
            table1
            .join( table2, Seq( "id", "From Date", "To Date" ), "left_outer" )

        val dfLeftOuterFinal =
            dfLeftOuter
            .join( table2.select( "id", "Associated id" ) , Seq( "id" ) )
            .drop( dfLeftOuter( "Associated id" ) )

        val dfRightOuter =
            table1
            .join( table2, Seq( "id", "From Date", "To Date" ), "right_outer" )

        val dfRightOuterFinal =
            dfRightOuter
            .join( table1.select( "id", "User" ) , Seq( "id" ) )
            .drop( dfRightOuter( "User" ) )

        spark.stop()
    }
}