火花外部加入源

时间:2017-12-06 22:10:50

标签: python apache-spark pyspark apache-spark-sql

我是相对较新的火花,我想知道我是否可以获得用于外连接的列的来源

我们说我有3个DF

DF 1

+-----+----+
|item1| key|
+-----+----+
|Item1|key1|
|Item2|key2|
|Item3|key3|
|Item4|key4|
|Item5|key5|
+-----+----+

DF2

+-----+----+
|item2| key|
+-----+----+
|   t1|key1|
|   t2|key2|
|   t3|key6|
|   t4|key7|
|   t5|key8|
+-----+----+

DF3

+-----+-----+
|item3|  key|
+-----+-----+
|   t1| key1|
|   t2| key2|
|   t3| key8|
|   t4| key9|
|   t5|key10|
+-----+-----+

我想在这3个数据帧上进行完全外连接,并包含一个新列,用于指示该键的来源。

E.g

+-----+-----+-----+-----+------+
|  key|item1|item2|item3|source|
+-----+-----+-----+-----+------+
| key8| null|   t5|   t3|   DF2|
| key5|Item5| null| null|   DF1|
| key7| null|   t4| null|   DF2|
| key3|Item3| null| null|   DF1|
| key6| null|   t3| null|   DF2|
| key1|Item1|   t1|   t1|   DF1|
| key4|Item4| null| null|   DF1|
| key2|Item2|   t2|   t2|   DF1|
| key9| null| null|   t4|   DF3|
|key10| null| null|   t5|   DF3|
+-----+-----+-----+-----+------+

有没有办法实现这个目标?

1 个答案:

答案 0 :(得分:0)

我做这样的事情:

from pyspark.sql.functions import col, lit, coalesce, when

df1 = spark.createDataFrame(
    [("Item1", "key1"), ("Item2", "key2"), ("Item3", "key3"), 
    ("Item4", "key4"), ("Item5", "key5")],
    ["item1", "key"])

df2 = spark.createDataFrame(
    [("t1", "key1"), ("t2", "key2"), ("t3", "key6"),
    ("t4", "key7"), ("t5", "key8")],
    ["item2", "key"])

df3 = spark.createDataFrame([
    ("t1", "key1"), ("t2", "key2"), ("t3", "key8"),
    ("t4", "key9"), ("t5", "key10")],
    ["item3", "key"])

df1.join(df2, ["key"], "outer").join(df3, ["key"], "outer").withColumn(
    "source",
    coalesce(
        when(col("item1").isNotNull(), "df1"), 
        when(col("item2").isNotNull(), "df2"),  
        when(col("item3").isNotNull(), "df3")))

结果是:

## +-----+-----+-----+-----+------+                      
## |  key|item1|item2|item3|source|
## +-----+-----+-----+-----+------+
## | key8| null|   t5|   t3|   df2|
## | key5|Item5| null| null|   df1|
## | key7| null|   t4| null|   df2|
## | key3|Item3| null| null|   df1|
## | key6| null|   t3| null|   df2|
## | key1|Item1|   t1|   t1|   df1|
## | key4|Item4| null| null|   df1|
## | key2|Item2|   t2|   t2|   df1|
## | key9| null| null|   t4|   df3|
## |key10| null| null|   t5|   df3|
## +-----+-----+-----+-----+------+