Question

我有两个数据帧df1和df2。两者都有一个专栏＆＃39; date＆＃39;如下所示。

df1的结构

+----------+
|      date|
+----------+
|02-01-2015|
|02-02-2015|
|02-03-2015|
+----------+

df2的结构

+---+-------+-----+----------+
| ID|feature|value|      date|
+---+-------+-----+----------+
|  1|balance|  100|01-01-2015|
|  1|balance|  100|05-01-2015|
|  1|balance|  100|30-01-2015|
|  1|balance|  100|01-02-2015|
|  1|balance|  100|01-03-2015|
+---+-------+-----+----------+

我必须在＆＃39; date＆＃39;来自df1的列，与df2＆quot; date＆＃39;进行比较并从df2获取小于df1中日期的所有行。

假设从df1获取第一行02-01-2015并从df2获取小于02-01-2015的所有行，这将产生如下输出

+---+-------+-----+----------+
| ID|feature|value|      date|
+---+-------+-----+----------+
|  1|balance|  100|01-01-2015|
+---+-------+-----+----------+

在spark-scala中实现这一目标的最佳方法是什么？我有数亿行。我想在spark中使用窗口函数，但窗口限制为一个数据帧。

Answer 1

这可以在新的数据框中获得所有结果：

val df1 = Seq(
  "02-01-2015",
  "02-02-2015",
  "02-03-2015"
).toDF("date")
  .withColumn("date", from_unixtime(unix_timestamp($"date", "dd-MM-yyyy")))

val df2 = Seq(
  (1, "balance", 100, "01-01-2015"),
  (1, "balance", 100, "05-01-2015"),
  (1, "balance", 100, "30-01-2015"),
  (1, "balance", 100, "01-02-2015"),
  (1, "balance", 100, "01-03-2015")
).toDF("ID", "feature", "value", "date")
  .withColumn("date", from_unixtime(unix_timestamp($"date", "dd-MM-yyyy")))

df1.join(
  df2, df2("date") < df1("date"), "left"
).show()


+-------------------+---+-------+-----+-------------------+
|               date| ID|feature|value|               date|
+-------------------+---+-------+-----+-------------------+
|2015-01-02 00:00:00|  1|balance|  100|2015-01-01 00:00:00|
|2015-02-02 00:00:00|  1|balance|  100|2015-01-01 00:00:00|
|2015-02-02 00:00:00|  1|balance|  100|2015-01-05 00:00:00|
|2015-02-02 00:00:00|  1|balance|  100|2015-01-30 00:00:00|
|2015-02-02 00:00:00|  1|balance|  100|2015-02-01 00:00:00|
|2015-03-02 00:00:00|  1|balance|  100|2015-01-01 00:00:00|
|2015-03-02 00:00:00|  1|balance|  100|2015-01-05 00:00:00|
|2015-03-02 00:00:00|  1|balance|  100|2015-01-30 00:00:00|
|2015-03-02 00:00:00|  1|balance|  100|2015-02-01 00:00:00|
|2015-03-02 00:00:00|  1|balance|  100|2015-03-01 00:00:00|
+-------------------+---+-------+-----+-------------------+

编辑：要从df2获取matchign记录的数量，请执行：

 df1.join(
    df2, df2("date") < df1("date"), "left"
 )
 .groupBy(df1("date"))
 .count
 .orderBy(df1("date"))
 .show

+-------------------+-----+
|               date|count|
+-------------------+-----+
|2015-01-02 00:00:00|    1|
|2015-02-02 00:00:00|    4|
|2015-03-02 00:00:00|    5|
+-------------------+-----+

Answer 2

如果您只想将df1的一行与df2 date进行比较，那么您应首先select来自df1的预期行

val oneRowDF1 = df1.select($"date".as("date2")).where($"date" === "02-01-2015")

那么你应该join使用你拥有的逻辑

df2.join(oneRowDF1, unix_timestamp(df2("date"), "dd-MM-yyyy") < unix_timestamp(oneRowDF1("date2"), "dd-MM-yyyy"))
    .drop("date2")

应该给你

+---+-------+-----+----------+
|ID |feature|value|date      |
+---+-------+-----+----------+
|1  |balance|100  |01-01-2015|
+---+-------+-----+----------+

<强>更新

连接很昂贵，因为它需要在不同节点的执行器之间进行数据混乱。

您可以简单地使用过滤功能，如下所示

val oneRowDF1 = df1.select(unix_timestamp($"date", "dd-MM-yyyy").as("date2")).where($"date" === "02-01-2015")

df2.filter(unix_timestamp($"date", "dd-MM-yyyy") < oneRowDF1.take(1)(0)(0))

我希望答案很有帮助

如何根据另一个数据帧的值（主键）计算spark数据帧中的行数？

2 个答案: