如何连接两个DataFrame并更改缺失值列?

时间:2017-04-18 21:30:39

标签: scala apache-spark apache-spark-sql

val df1 = sc.parallelize(Seq(
   ("a1",10,"ACTIVE","ds1"),
   ("a1",20,"ACTIVE","ds1"),
   ("a2",50,"ACTIVE","ds1"),
   ("a3",60,"ACTIVE","ds1"))
).toDF("c1","c2","c3","c4")`

val df2 = sc.parallelize(Seq(
   ("a1",10,"ACTIVE","ds2"),
   ("a1",20,"ACTIVE","ds2"),
   ("a1",30,"ACTIVE","ds2"),
   ("a1",40,"ACTIVE","ds2"),
   ("a4",20,"ACTIVE","ds2"))
).toDF("c1","c2","c3","c5")`


df1.show()

// +---+---+------+---+
// | c1| c2|    c3| c4|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds1|
// | a1| 20|ACTIVE|ds1|
// | a2| 50|ACTIVE|ds1|
// | a3| 60|ACTIVE|ds1|
// +---+---+------+---+

df2.show()
// +---+---+------+---+
// | c1| c2|    c3| c5|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds2|
// | a1| 20|ACTIVE|ds2|
// | a1| 30|ACTIVE|ds2|
// | a1| 40|ACTIVE|ds2|
// | a4| 20|ACTIVE|ds2|
// +---+---+------+---+

我的要求是:我需要加入两个数据帧。 我的输出数据帧应该包含来自df1的所有记录以及来自df2的记录,这些记录不在df1中,仅用于匹配的“c1”。我从df2中提取的记录应更新为“c3”栏中的非活动状态。

在此示例中,只有匹配值“c1”为a1。所以我需要从df2中提取c2 = 30和40条记录,并将它们设为INACTIVE。

这是输出。

df_output.show()

// +---+---+--------+---+
// | c1| c2|    c3  | c4|
// +---+---+--------+---+
// | a1| 10|ACTIVE  |ds1|
// | a1| 20|ACTIVE  |ds1|
// | a2| 50|ACTIVE  |ds1|
// | a3| 60|ACTIVE  |ds1|
// | a1| 30|INACTIVE|ds1|
// | a1| 40|INACTIVE|ds1|
// +---+---+--------+---+

任何人都可以帮我这么做。

3 个答案:

答案 0 :(得分:1)

首先,一件小事。我为df2中的列使用了不同的名称:

val df2 = sc.parallelize(...).toDF("d1","d2","d3","d4")

没什么大不了的,但这让我更容易理解。

现在有趣的东西。为了清楚起见,我会有点冗长:

val join = df1
.join(df2, df1("c1") === df2("d1"), "inner")
.select($"d1", $"d2", $"d3", lit("ds1").as("d4"))
.dropDuplicates

我在这里执行以下操作:

  • df1df2列上的c1d1之间的内部联接
  • 选择df2列,然后选择"硬编码"替换ds1
  • 的最后一列中的ds2
  • 删除重复项

这基本上只过滤了df2没有c1 df1中有相应密钥的所有内容。

接下来我差异:

val diff = join
.except(df1)
.select($"d1", $"d2", lit("INACTIVE").as("d3"), $"d4")

这是一个基本的设置操作,用于查找joindf1中的所有内容。这些是要停用的项目,因此我选择了所有列,但用硬编码的INACTIVE替换第三列。

剩下的就是将它们放在一起:

df1.union(diff)

这只是将df1与我们之前计算的停用值表相结合,以产生最终结果:

+---+---+--------+---+
| c1| c2|      c3| c4|
+---+---+--------+---+
| a1| 10|  ACTIVE|ds1|
| a1| 20|  ACTIVE|ds1|
| a2| 50|  ACTIVE|ds1|
| a3| 60|  ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
+---+---+--------+---+

同样,你不需要所有这些中间价值。我只是冗长地帮助追踪整个过程。

答案 1 :(得分:0)

这是一个肮脏的解决方案 -

from pyspark.sql import functions as F


# find the rows from df2 that have matching key c1 in df2
df3 = df1.join(df2,df1.c1==df2.c1)\
.select(df2.c1,df2.c2,df2.c3,df2.c5.alias('c4'))\
.dropDuplicates()

df3.show()

+---+---+------+---+
| c1| c2|    c3| c4|
+---+---+------+---+
| a1| 10|ACTIVE|ds2|
| a1| 20|ACTIVE|ds2|
| a1| 30|ACTIVE|ds2|
| a1| 40|ACTIVE|ds2|
+---+---+------+---+

# Union df3 with df1 and change columns c3 and c4 if c4 value is 'ds2'

df1.union(df3).dropDuplicates(['c1','c2'])\
.select('c1','c2',\
        F.when(df1.c4=='ds2','INACTIVE').otherwise('ACTIVE').alias('c3'),
        F.when(df1.c4=='ds2','ds1').otherwise('ds1').alias('c4')
       )\
.orderBy('c1','c2')\
.show()

+---+---+--------+---+
| c1| c2|      c3| c4|
+---+---+--------+---+
| a1| 10|  ACTIVE|ds1|
| a1| 20|  ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
| a2| 50|  ACTIVE|ds1|
| a3| 60|  ACTIVE|ds1|
+---+---+--------+---+

答案 2 :(得分:0)

享受挑战,这是我的解决方案。

val c1keys = df1.select("c1").distinct
val df2_in_df1 = df2.join(c1keys, Seq("c1"), "inner")
val df2inactive = df2_in_df1.join(df1, Seq("c1", "c2"), "leftanti").withColumn("c3", lit("INACTIVE"))
scala> df1.union(df2inactive).show
+---+---+--------+---+
| c1| c2|      c3| c4|
+---+---+--------+---+
| a1| 10|  ACTIVE|ds1|
| a1| 20|  ACTIVE|ds1|
| a2| 50|  ACTIVE|ds1|
| a3| 60|  ACTIVE|ds1|
| a1| 30|INACTIVE|ds2|
| a1| 40|INACTIVE|ds2|
+---+---+--------+---+