过滤一个数据帧的值(如果存在)而在另一个数据帧中不存在

时间:2019-03-30 18:39:15

标签: scala apache-spark

我有两个数据集详细信息和参考详细信息。 详细信息

代码日期位置温度

1      1-1-19    blr      30


2      1-2-18    up       33


3      1-2-18    dlh      25

refrenceDetails

代码日期位置

1 1-1-19 blr

2 1-2-18起

如果代码存在于referenceDetails数据集中,则我想从详细信息数据集中过滤记录作为有效详细信息,否则立即作为无效详细信息

我尝试进行内部联接和left_anti联接。但是我必须参加两次。有什么办法可以避免两次加入

  val invalidRecords = detailsDS.join(referencedetailsDS,Seq(Code),"left_anti")


val invalidRecords = detailsDS.join(referencedetailsDS,Seq(Code),"inner")

 Valid Details


#code    date   location temperature


    1     1-1-19    blr      30


    2     1-2-18  up       33

 Invalid Details

code    date   location temperature

    3     1-2-18  dlh      25

1 个答案:

答案 0 :(得分:1)

具有左联接:

// data
val details = Seq(
  (1, "1-1-19", "blr", 30),
  (2, "1-2-18", "up", 33),
  (3, "1-2-18", "dlh", 25)
).toDF("code", "date", "location", "temperature")

val refrenceDetails = Seq(
  (1, "1-1-19", "blr"),
  (2, "1-2-18", "up")).
  toDF("code", "date", "location")

// action
val joined = details.alias("d").join(refrenceDetails.alias("r"), Seq("code"), "left")
val validDetails = joined.where($"r.code".isNotNull)
val invalidDetails = joined.where($"r.code".isNull)

// display
validDetails.show(false)
invalidDetails.show(false)

输出:

+----+------+--------+-----------+------+--------+
|code|date  |location|temperature|date  |location|
+----+------+--------+-----------+------+--------+
|1   |1-1-19|blr     |30         |1-1-19|blr     |
|2   |1-2-18|up      |33         |1-2-18|up      |
+----+------+--------+-----------+------+--------+

+----+------+--------+-----------+----+--------+
|code|date  |location|temperature|date|location|
+----+------+--------+-----------+----+--------+
|3   |1-2-18|dlh     |25         |null|null    |
+----+------+--------+-----------+----+--------+