使用或运算符,

时间:2018-08-18 04:58:45

标签: scala apache-spark

数据框1:

+---------+---------+
|login_Id1|login_Id2|
+---------+---------+
|  1234567|  1234568|
|  1234567|     null|
|     null|  1234568|
|  1234567|  1000000|
|  1000000|  1234568|
|  1000000|  1000000|
+---------+---------+

DataFrame 2:

+--------+---------+-----------+
|login_Id|user_name| user_Email|
+--------+---------+-----------+
| 1234567|TestUser1|user1_Email|
| 1234568|TestUser2|user2_Email|
| 1234569|TestUser3|user3_Email|
| 1234570|TestUser4|user4_Email|
+--------+---------+-----------+

预期产量

+---------+---------+--------+---------+-----------+
|login_Id1|login_Id2|login_Id|user_name| user_Email|
+---------+---------+--------+---------+-----------+
|  1234567|  1234568| 1234567|TestUser1|user1_Email|
|  1234567|     null| 1234567|TestUser1|user1_Email|
|     null|  1234568| 1234568|TestUser2|user2_Email|
|  1234567|  1000000| 1234567|TestUser1|user1_Email|
|  1000000|  1234568| 1234568|TestUser2|user2_Email|
|  1000000|  1000000|    null|     null|       null|
+---------+---------+--------+---------+-----------+

我的要求是我必须同时连接两个数据框以便从DataFrame 2中获取每个登录ID的其他信息.login_Id1或login_Id2都将具有数据(在大多数情况下)。有时两列也可能有数据。在这种情况下,我想使用login_Id1进行连接。当两列都不匹配时,我希望结果为空

我点击了此链接

Join in spark dataframe (scala) based on not null values

我尝试过

DataFrame1.join(broadcast(DataFrame2), DataFrame1("login_Id1") === DataFrame2("login_Id") || DataFrame1("login_Id2") === DataFrame2("login_Id") )

我得到的输出是

+---------+---------+--------+---------+-----------+
|login_Id1|login_Id2|login_Id|user_name| user_Email|
+---------+---------+--------+---------+-----------+
|  1234567|  1234568| 1234567|TestUser1|user1_Email|
|  1234567|  1234568| 1234568|TestUser2|user2_Email|
|  1234567|     null| 1234567|TestUser1|user1_Email|
|     null|  1234568| 1234568|TestUser2|user2_Email|
|  1234567|  1000000| 1234567|TestUser1|user1_Email|
|  1000000|  1234568| 1234568|TestUser2|user2_Email|
|  1000000|  1000000|    null|     null|       null|
+---------+---------+--------+---------+-----------+

当两个列中的任何一个都有值时,我都会得到预期的行为。当两个列中都有值时,将对两个列(Row1,Row3)执行联接。不短路吗?

有没有一种方法可以获取期望的数据帧?

到目前为止,我有一个udf函数,用于检查login_Id1是否具有值(返回login_Id1)或login_Id2是否具有值(返回login_Id2),如果它们都具有值,我将返回loginId1,并添加udf函数的结果作为DataFrame1的另一列(Filtered_Login_id)。

使用udf添加FilteredId列后的Dataframe1

+--------+---------+-----------+
|loginId1|loginId2 | FilteredId|
+--------+---------+-----------+
| 1234567|1234568  |1234567    |
| 1234567|null     |1234567    |
| null   |1234568  |1234568    |
| 1234567|1000000  |1234567    |
| 1000000|1234568  |1000000    |
| 1000000|1000000  |1000000    |
+--------+---------+-----------+

然后我根据FilteredId === loginId执行联接并获取结果

DataFrame1.join(broadcast(DataFrame2), DataFrame1("FilteredId") === DataFrame2("login_Id"),"left_outer" )

在没有udf的情况下,是否有更好的方法?仅使用join(其行为类似于短路或运算符)?

包括Leo指出的用例。我的udf方法错过了Leo指出的用例。我的确切要求是,如果两个输入列值(login_Id1,login_Id2)中的任何一个与Dataframe2的login_Id相匹配,那应该获取loginId数据。如果其中任何一列都不匹配,则应添加null(类似于左外部联接)

3 个答案:

答案 0 :(得分:0)

如果第一列为空,则只需要第二列,将该条件添加到您的join子句中:

@ df1.join(df2, df1("login_Id1") <=> df2("login_Id") || (df1("login_Id1").isNull && df1("login_Id2") <=> df2("login_Id"))).show()
+---------+---------+--------+---------+-----------+
|login_Id1|login_Id2|login_Id|user_name| user_Email|
+---------+---------+--------+---------+-----------+
|  1234567|  1234568| 1234567|TestUser1|user1_Email|
|  1234567|     null| 1234567|TestUser1|user1_Email|
|     null|  1234568| 1234568|TestUser2|user2_Email|
+---------+---------+--------+---------+-----------+

注意:右侧仅找到该行:

@ df1.join(df2, df1("login_Id1").isNull && df1("login_Id2") <=> df2("login_Id")).show()
+---------+---------+--------+---------+-----------+
|login_Id1|login_Id2|login_Id|user_name| user_Email|
+---------+---------+--------+---------+-----------+
|     null|  1234568| 1234568|TestUser2|user2_Email|
+---------+---------+--------+---------+-----------+

答案 1 :(得分:0)

您可以使用coalesce函数创建一个新值,该新值可以为login_Id1(如果不为null)或login_Id2(如果为1为null)-并将结果与​​{ {1}}:

login_Id

答案 2 :(得分:0)

我不清楚您的样本数据是否已经涵盖login_Id对的所有情况。如果是这样,则专注于null检查的解决方案就足够了;否则,将需要稍微复杂一些(例如您使用UDF)。

不依赖UDF的一种方法是在left_outer上应用df1联接,在left_semi上应用df2联接,每个联接都附加一个flag列进行首选项排序,通过union将它们组合在一起,加入df2以包括非关键列,最后消除基于flag的重复行。

以下是示例代码,其中包含了更为通用的示例数据:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df1 = Seq(
  ("1234567", "1234568"),
  ("1234567", null),
  (null, "1234568"),
  ("1234569", "1000000"),
  ("1000000", "1234570"),
  ("1000000", "1000000")
).toDF("login_Id1", "login_Id2")

val df2 = Seq(
  ("1234567", "TestUser1", "user1_Email"),
  ("1234568", "TestUser2", "user2_Email"),
  ("1234569", "TestUser3", "user3_Email"),
  ("1234570", "TestUser4", "user4_Email")
).toDF("login_Id", "user_name", "user_Email")

val dfOuter = df1.join(df2, $"login_Id1" === df2("login_Id"), "left_outer").
  withColumn("flag", when($"login_Id".isNull, lit(9)).otherwise(lit(1))).
  select("login_Id1", "login_Id2", "flag")
// +---------+---------+----+
// |login_Id1|login_Id2|flag|
// +---------+---------+----+
// |  1234567|  1234568|   1|
// |  1234567|     null|   1|
// |     null|  1234568|   9|
// |  1234569|  1000000|   1|
// |  1000000|  1234570|   9|
// |  1000000|  1000000|   9|
// +---------+---------+----+

val dfSemi = df1.join(df2, $"login_Id2" === df2("login_Id"), "left_semi").
  withColumn("flag", lit(2))
// +---------+---------+----+
// |login_Id1|login_Id2|flag|
// +---------+---------+----+
// |  1234567|  1234568|   2|
// |     null|  1234568|   2|
// |  1000000|  1234570|   2|
// +---------+---------+----+

val window = Window.partitionBy("login_Id1", "login_Id2").orderBy("flag")

(dfOuter union dfSemi).
  withColumn("row_num", row_number.over(window)).
  where($"row_num" === 1).
  withColumn("login_Id", when($"flag" === 1, $"login_Id1").
    otherwise(when($"flag" === 2, $"login_Id2"))
  ).
  join(df2, Seq("login_Id"), "left_outer").
  select("login_Id1", "login_Id2", "login_Id", "user_name", "user_Email")
// +---------+---------+--------+---------+-----------+
// |login_Id1|login_Id2|login_Id|user_name| user_Email|
// +---------+---------+--------+---------+-----------+
// |  1000000|  1000000|    null|     null|       null|
// |  1000000|  1234570| 1234570|TestUser4|user4_Email|
// |  1234567|  1234568| 1234567|TestUser1|user1_Email|
// |  1234569|  1000000| 1234569|TestUser3|user3_Email|
// |  1234567|     null| 1234567|TestUser1|user1_Email|
// |     null|  1234568| 1234568|TestUser2|user2_Email|
// +---------+---------+--------+---------+-----------+

请注意,如果您将broadcast应用于df2,则与现有示例代码一样,如果它比df1小得多。如果df2足够小以进行collect编辑,则可以将其简化为以下内容:

val loginIdList = df2.collect.map(r => r.getAs[String](0))

val df1Unmatched = df1.where(
  !$"login_Id1".isin(loginIdList: _*) && !$"login_Id2".isin(loginIdList: _*)
)

(df1 except df1Unmatched).
  join( broadcast(df2), $"login_Id1" === $"login_Id" ||
    ($"login_Id2" === $"login_Id" &&
      ($"login_Id1".isNull || !$"login_Id1".isin(loginIdList: _*))
    )
  ).
  union(
    df1Unmatched.join(df2, $"login_Id2" === $"login_Id", "left_outer")
  )