我有两个数据帧df1和df2。两者都有一个专栏' date'如下所示。
df1的结构
+----------+
| date|
+----------+
|02-01-2015|
|02-02-2015|
|02-03-2015|
+----------+
df2的结构
+---+-------+-----+----------+
| ID|feature|value| date|
+---+-------+-----+----------+
| 1|balance| 100|01-01-2015|
| 1|balance| 100|05-01-2015|
| 1|balance| 100|30-01-2015|
| 1|balance| 100|01-02-2015|
| 1|balance| 100|01-03-2015|
+---+-------+-----+----------+
我必须在' date'来自df1的列,与df2" date'进行比较并从df2获取小于df1中日期的所有行。
假设从df1获取第一行02-01-2015并从df2获取小于02-01-2015的所有行,这将产生如下输出
+---+-------+-----+----------+
| ID|feature|value| date|
+---+-------+-----+----------+
| 1|balance| 100|01-01-2015|
+---+-------+-----+----------+
在spark-scala中实现这一目标的最佳方法是什么?我有数亿行。我想在spark中使用窗口函数,但窗口限制为一个数据帧。
答案 0 :(得分:1)
这可以在新的数据框中获得所有结果:
val df1 = Seq(
"02-01-2015",
"02-02-2015",
"02-03-2015"
).toDF("date")
.withColumn("date", from_unixtime(unix_timestamp($"date", "dd-MM-yyyy")))
val df2 = Seq(
(1, "balance", 100, "01-01-2015"),
(1, "balance", 100, "05-01-2015"),
(1, "balance", 100, "30-01-2015"),
(1, "balance", 100, "01-02-2015"),
(1, "balance", 100, "01-03-2015")
).toDF("ID", "feature", "value", "date")
.withColumn("date", from_unixtime(unix_timestamp($"date", "dd-MM-yyyy")))
df1.join(
df2, df2("date") < df1("date"), "left"
).show()
+-------------------+---+-------+-----+-------------------+
| date| ID|feature|value| date|
+-------------------+---+-------+-----+-------------------+
|2015-01-02 00:00:00| 1|balance| 100|2015-01-01 00:00:00|
|2015-02-02 00:00:00| 1|balance| 100|2015-01-01 00:00:00|
|2015-02-02 00:00:00| 1|balance| 100|2015-01-05 00:00:00|
|2015-02-02 00:00:00| 1|balance| 100|2015-01-30 00:00:00|
|2015-02-02 00:00:00| 1|balance| 100|2015-02-01 00:00:00|
|2015-03-02 00:00:00| 1|balance| 100|2015-01-01 00:00:00|
|2015-03-02 00:00:00| 1|balance| 100|2015-01-05 00:00:00|
|2015-03-02 00:00:00| 1|balance| 100|2015-01-30 00:00:00|
|2015-03-02 00:00:00| 1|balance| 100|2015-02-01 00:00:00|
|2015-03-02 00:00:00| 1|balance| 100|2015-03-01 00:00:00|
+-------------------+---+-------+-----+-------------------+
编辑: 要从df2获取matchign记录的数量,请执行:
df1.join(
df2, df2("date") < df1("date"), "left"
)
.groupBy(df1("date"))
.count
.orderBy(df1("date"))
.show
+-------------------+-----+
| date|count|
+-------------------+-----+
|2015-01-02 00:00:00| 1|
|2015-02-02 00:00:00| 4|
|2015-03-02 00:00:00| 5|
+-------------------+-----+
答案 1 :(得分:0)
如果您只想将df1
的一行与df2
date
进行比较,那么您应首先select
来自df1
的预期行
val oneRowDF1 = df1.select($"date".as("date2")).where($"date" === "02-01-2015")
那么你应该join
使用你拥有的逻辑
df2.join(oneRowDF1, unix_timestamp(df2("date"), "dd-MM-yyyy") < unix_timestamp(oneRowDF1("date2"), "dd-MM-yyyy"))
.drop("date2")
应该给你
+---+-------+-----+----------+
|ID |feature|value|date |
+---+-------+-----+----------+
|1 |balance|100 |01-01-2015|
+---+-------+-----+----------+
<强>更新强>
连接很昂贵,因为它需要在不同节点的执行器之间进行数据混乱。
您可以简单地使用过滤功能,如下所示
val oneRowDF1 = df1.select(unix_timestamp($"date", "dd-MM-yyyy").as("date2")).where($"date" === "02-01-2015")
df2.filter(unix_timestamp($"date", "dd-MM-yyyy") < oneRowDF1.take(1)(0)(0))
我希望答案很有帮助