val df1 = sc.parallelize(Seq(
("a1",10,"ACTIVE","ds1"),
("a1",20,"ACTIVE","ds1"),
("a2",50,"ACTIVE","ds1"),
("a3",60,"ACTIVE","ds1"))
).toDF("c1","c2","c3","c4")`
val df2 = sc.parallelize(Seq(
("a1",10,"ACTIVE","ds2"),
("a1",20,"ACTIVE","ds2"),
("a1",30,"ACTIVE","ds2"),
("a1",40,"ACTIVE","ds2"),
("a4",20,"ACTIVE","ds2"))
).toDF("c1","c2","c3","c5")`
df1.show()
// +---+---+------+---+
// | c1| c2| c3| c4|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds1|
// | a1| 20|ACTIVE|ds1|
// | a2| 50|ACTIVE|ds1|
// | a3| 60|ACTIVE|ds1|
// +---+---+------+---+
df2.show()
// +---+---+------+---+
// | c1| c2| c3| c5|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds2|
// | a1| 20|ACTIVE|ds2|
// | a1| 30|ACTIVE|ds2|
// | a1| 40|ACTIVE|ds2|
// | a4| 20|ACTIVE|ds2|
// +---+---+------+---+
我的要求是:我需要加入两个数据帧。 我的输出数据帧应该包含来自df1的所有记录以及来自df2的记录,这些记录不在df1中,仅用于匹配的“c1”。我从df2中提取的记录应更新为“c3”栏中的非活动状态。
在此示例中,只有匹配值“c1”为a1。所以我需要从df2中提取c2 = 30和40条记录,并将它们设为INACTIVE。
这是输出。
df_output.show()
// +---+---+--------+---+
// | c1| c2| c3 | c4|
// +---+---+--------+---+
// | a1| 10|ACTIVE |ds1|
// | a1| 20|ACTIVE |ds1|
// | a2| 50|ACTIVE |ds1|
// | a3| 60|ACTIVE |ds1|
// | a1| 30|INACTIVE|ds1|
// | a1| 40|INACTIVE|ds1|
// +---+---+--------+---+
任何人都可以帮我这么做。
答案 0 :(得分:1)
首先,一件小事。我为df2
中的列使用了不同的名称:
val df2 = sc.parallelize(...).toDF("d1","d2","d3","d4")
没什么大不了的,但这让我更容易理解。
现在有趣的东西。为了清楚起见,我会有点冗长:
val join = df1
.join(df2, df1("c1") === df2("d1"), "inner")
.select($"d1", $"d2", $"d3", lit("ds1").as("d4"))
.dropDuplicates
我在这里执行以下操作:
df1
和df2
列上的c1
和d1
之间的内部联接df2
列,然后选择"硬编码"替换ds1
ds2
这基本上只过滤了df2
中没有在c1
df1
中有相应密钥的所有内容。
接下来我差异:
val diff = join
.except(df1)
.select($"d1", $"d2", lit("INACTIVE").as("d3"), $"d4")
这是一个基本的设置操作,用于查找join
中不的df1
中的所有内容。这些是要停用的项目,因此我选择了所有列,但用硬编码的INACTIVE
替换第三列。
剩下的就是将它们放在一起:
df1.union(diff)
这只是将df1
与我们之前计算的停用值表相结合,以产生最终结果:
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
+---+---+--------+---+
同样,你不需要所有这些中间价值。我只是冗长地帮助追踪整个过程。
答案 1 :(得分:0)
这是一个肮脏的解决方案 -
from pyspark.sql import functions as F
# find the rows from df2 that have matching key c1 in df2
df3 = df1.join(df2,df1.c1==df2.c1)\
.select(df2.c1,df2.c2,df2.c3,df2.c5.alias('c4'))\
.dropDuplicates()
df3.show()
+---+---+------+---+
| c1| c2| c3| c4|
+---+---+------+---+
| a1| 10|ACTIVE|ds2|
| a1| 20|ACTIVE|ds2|
| a1| 30|ACTIVE|ds2|
| a1| 40|ACTIVE|ds2|
+---+---+------+---+
# Union df3 with df1 and change columns c3 and c4 if c4 value is 'ds2'
df1.union(df3).dropDuplicates(['c1','c2'])\
.select('c1','c2',\
F.when(df1.c4=='ds2','INACTIVE').otherwise('ACTIVE').alias('c3'),
F.when(df1.c4=='ds2','ds1').otherwise('ds1').alias('c4')
)\
.orderBy('c1','c2')\
.show()
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
+---+---+--------+---+
答案 2 :(得分:0)
享受挑战,这是我的解决方案。
val c1keys = df1.select("c1").distinct
val df2_in_df1 = df2.join(c1keys, Seq("c1"), "inner")
val df2inactive = df2_in_df1.join(df1, Seq("c1", "c2"), "leftanti").withColumn("c3", lit("INACTIVE"))
scala> df1.union(df2inactive).show
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
| a1| 30|INACTIVE|ds2|
| a1| 40|INACTIVE|ds2|
+---+---+--------+---+