我有两个Hive表/ Spark数据帧A和B
A
--------+----------+------+
product | date | id |
--------+----------+------+
A | 20200201 | X |
--------+----------+------+
B | 20200301 | Y |
--------+----------+------+
B
--------+-------+----------+
product | value | date |
--------+-------+----------+
A | 10 | 20191230 |
--------+-------+----------+
A | 5 | 20200310 |
--------+-------+----------+
B | 20 | 20200220 |
--------+-------+----------+
B | 10 | 20200130 |
--------+-------+----------+
我想要一个类似
的结果 --------+----+-------+
product | id | value |
--------+----+-------+
A | X | 10 |
--------+----+-------+
B | Y | 20 |
--------+----+-------+
对于产品,如果在B中未找到表/ DF A中的日期,则认为具有上一个日期的行将从B中获得 value 列以获取结果。 / p>
有人可以帮我吗?
答案 0 :(得分:0)
我将这种方法与SparkSQL
一起使用来解决您的问题
import spark.implicits._
val dataA = Seq( ("A", 20200201,"X"), ("B",20200301, "Y"))
val dataB = Seq( ("A", 10, 20191230), ("A",5, 20200310), ("B", 20, 20200220), ("B", 10, 20200130))
val dfA = sc.parallelize(dataA).toDF("product", "date", "id")
val dfB = sc.parallelize(dataB).toDF("product", "value", "date")
dfA.createOrReplaceTempView("ta")
dfB.createOrReplaceTempView("tb")
sqlContext.sql(
"""
|WITH filt AS (
|SELECT DISTINCT a.product, a.date, a.id, b.value, b.date,
|RANK() OVER(PARTITION BY a.product ORDER BY b.date DESC) AS rnk
|FROM ta AS a
|JOIN tb AS b ON(a.date != b.date)
|WHERE a.date > b.date)
|SELECT filt.product, filt.id, filt.value
| FROM filt
| WHERE filt.rnk = 1
|""".stripMargin).show(truncate = false)
具有预期的输出
+-------+---+-----+
|product|id |value|
+-------+---+-----+
|B |Y |20 |
|A |X |10 |
+-------+---+-----+