如何在具有多列的spark scala数据框中应用分区?

时间:2018-01-30 00:10:41

标签: scala apache-spark

我在Spark Scala中有以下数据框df:

function prefix(path: string): string {
  return `https://example.com${path}`
}

function add(x: number, y: number): number {
  return x + y
}

const guardedPrefix: (?string | null) => (string | null) = nilGuard(prefix)
const guardedAddWithDefault: (?number | null, ?number | null) => number = nilGuard(add, 0)

然后获得start_date的指定闭包且小于

预期产出:

id   project  start_date    Change_date designation

1    P1       08/10/2018      01/09/2017   2
1    P1       08/10/2018      02/11/2018   3
1    P1       08/10/2018      01/08/2016   1

这是因为更改日期01/09/2017是start_date之前的最近日期。

有人可以建议如何实现这个目标吗?

这不是选择第一行,而是选择与最接近开始日期的更改日期相对应的名称

1 个答案:

答案 0 :(得分:1)

解析日期:

import org.apache.spark.sql.functions._

val spark: SparkSession = ???
import spark.implicits._

val df = Seq(
  (1, "P1", "08/10/2018", "01/09/2017", 2), 
  (1, "P1", "08/10/2018", "02/11/2018", 3),
  (1, "P1", "08/10/2018", "01/08/2016", 1)
).toDF("id", "project_id", "start_date", "changed_date", "designation")

val parsed = df
  .withColumn("start_date", to_date($"start_date", "dd/MM/yyyy"))        
  .withColumn("changed_date", to_date($"changed_date", "dd/MM/yyyy"))

找出差异

 val diff = parsed
   .withColumn("diff", datediff($"start_date", $"changed_date"))
   .where($"diff" > 0)

How to select the first row of each group?应用您选择的解决方案,例如窗口函数。如果您按id分组:

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy($"id").orderBy($"diff")

diff.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").show
// +---+----------+----------+------------+-----------+----+
// | id|project_id|start_date|changed_date|designation|diff|
// +---+----------+----------+------------+-----------+----+
// |  1|        P1|2018-10-08|  2017-09-01|          2| 402|
// +---+----------+----------+------------+-----------+----+

参考: