我是相对较新的火花,我想知道我是否可以获得用于外连接的列的来源
我们说我有3个DF
DF 1
+-----+----+
|item1| key|
+-----+----+
|Item1|key1|
|Item2|key2|
|Item3|key3|
|Item4|key4|
|Item5|key5|
+-----+----+
DF2
+-----+----+
|item2| key|
+-----+----+
| t1|key1|
| t2|key2|
| t3|key6|
| t4|key7|
| t5|key8|
+-----+----+
DF3
+-----+-----+
|item3| key|
+-----+-----+
| t1| key1|
| t2| key2|
| t3| key8|
| t4| key9|
| t5|key10|
+-----+-----+
我想在这3个数据帧上进行完全外连接,并包含一个新列,用于指示该键的来源。
E.g
+-----+-----+-----+-----+------+
| key|item1|item2|item3|source|
+-----+-----+-----+-----+------+
| key8| null| t5| t3| DF2|
| key5|Item5| null| null| DF1|
| key7| null| t4| null| DF2|
| key3|Item3| null| null| DF1|
| key6| null| t3| null| DF2|
| key1|Item1| t1| t1| DF1|
| key4|Item4| null| null| DF1|
| key2|Item2| t2| t2| DF1|
| key9| null| null| t4| DF3|
|key10| null| null| t5| DF3|
+-----+-----+-----+-----+------+
有没有办法实现这个目标?
答案 0 :(得分:0)
我做这样的事情:
from pyspark.sql.functions import col, lit, coalesce, when
df1 = spark.createDataFrame(
[("Item1", "key1"), ("Item2", "key2"), ("Item3", "key3"),
("Item4", "key4"), ("Item5", "key5")],
["item1", "key"])
df2 = spark.createDataFrame(
[("t1", "key1"), ("t2", "key2"), ("t3", "key6"),
("t4", "key7"), ("t5", "key8")],
["item2", "key"])
df3 = spark.createDataFrame([
("t1", "key1"), ("t2", "key2"), ("t3", "key8"),
("t4", "key9"), ("t5", "key10")],
["item3", "key"])
df1.join(df2, ["key"], "outer").join(df3, ["key"], "outer").withColumn(
"source",
coalesce(
when(col("item1").isNotNull(), "df1"),
when(col("item2").isNotNull(), "df2"),
when(col("item3").isNotNull(), "df3")))
结果是:
## +-----+-----+-----+-----+------+
## | key|item1|item2|item3|source|
## +-----+-----+-----+-----+------+
## | key8| null| t5| t3| df2|
## | key5|Item5| null| null| df1|
## | key7| null| t4| null| df2|
## | key3|Item3| null| null| df1|
## | key6| null| t3| null| df2|
## | key1|Item1| t1| t1| df1|
## | key4|Item4| null| null| df1|
## | key2|Item2| t2| t2| df1|
## | key9| null| null| t4| df3|
## |key10| null| null| t5| df3|
## +-----+-----+-----+-----+------+