我有两个数据集详细信息和参考详细信息。 详细信息
1 1-1-19 blr 30
2 1-2-18 up 33
3 1-2-18 dlh 25
refrenceDetails
1 1-1-19 blr
2 1-2-18起
如果代码存在于referenceDetails数据集中,则我想从详细信息数据集中过滤记录作为有效详细信息,否则立即作为无效详细信息
我尝试进行内部联接和left_anti联接。但是我必须参加两次。有什么办法可以避免两次加入
val invalidRecords = detailsDS.join(referencedetailsDS,Seq(Code),"left_anti")
val invalidRecords = detailsDS.join(referencedetailsDS,Seq(Code),"inner")
Valid Details
#code date location temperature
1 1-1-19 blr 30
2 1-2-18 up 33
Invalid Details
code date location temperature
3 1-2-18 dlh 25
答案 0 :(得分:1)
具有左联接:
// data
val details = Seq(
(1, "1-1-19", "blr", 30),
(2, "1-2-18", "up", 33),
(3, "1-2-18", "dlh", 25)
).toDF("code", "date", "location", "temperature")
val refrenceDetails = Seq(
(1, "1-1-19", "blr"),
(2, "1-2-18", "up")).
toDF("code", "date", "location")
// action
val joined = details.alias("d").join(refrenceDetails.alias("r"), Seq("code"), "left")
val validDetails = joined.where($"r.code".isNotNull)
val invalidDetails = joined.where($"r.code".isNull)
// display
validDetails.show(false)
invalidDetails.show(false)
输出:
+----+------+--------+-----------+------+--------+
|code|date |location|temperature|date |location|
+----+------+--------+-----------+------+--------+
|1 |1-1-19|blr |30 |1-1-19|blr |
|2 |1-2-18|up |33 |1-2-18|up |
+----+------+--------+-----------+------+--------+
+----+------+--------+-----------+----+--------+
|code|date |location|temperature|date|location|
+----+------+--------+-----------+----+--------+
|3 |1-2-18|dlh |25 |null|null |
+----+------+--------+-----------+----+--------+