使用列中的下一个非空值或键的列中的prev值更新Dataframe。

时间:2015-07-29 17:10:06

标签: scala apache-spark dataframe

假设我有一个包含以下四列的DataFrame:

Employee      Action      Updated on           Salaried on 
1             emailed      2015-07-01        2015-07-12
1             worked       2015-07-03        null
1             played       2015-07-06        2015-07-28
1             finished     2015-07-07        null
2             food         2015-07-09        null
2             cool         2015-07-11        2015-07-10

答案应该是:

Employee      Action      Updated on         Salaried on 
1             emailed      2015-07-01        2015-07-12
1             worked       2015-07-03        2015-07-28
1             played       2015-07-06        2015-07-28
1             finished     2015-07-07        2015-07-28
2             food         2015-07-09        2015-07-10
2             cool         2015-07-11        2015-07-10

发生了什么事?

对于每位员工,根据更新的准时,如果“薪水”中的任何条目为空,则将从同一员工的最近的未来获取“薪水”值,否则它将从同一列中获取相同的值最近的。

例如 第5行从第6行取值。 第4行从第3行取值 第2行从第3行取值。 注意:未来将获得优势

我的尝试:我尝试过使用map&减少,但我们有一个很好的技术,以更好的方式解决它的火花强度?

1 个答案:

答案 0 :(得分:2)

如果您假设无限数量的条目,可能的差距大小以及您对a comment中描述的不限时间窗口的值感兴趣,那么您所能做的就是希望Catalyst优化器能够能够聪明地做点什么。首先让我们重现示例数据:

import org.apache.spark.sql.functions.{coalesce, not}
case class Record(employee: Int, action: String, updated_on: java.sql.Date, salaried_on: java.sql.Date)

val rdd = sc.parallelize(List(
    Record(1, "emailed" , java.sql.Date.valueOf("2015-07-01"), java.sql.Date.valueOf("2015-07-12")),
    Record(1, "worked"  , java.sql.Date.valueOf("2015-07-03"), null),
    Record(1, "played"  , java.sql.Date.valueOf("2015-07-06"), java.sql.Date.valueOf("2015-07-28")),
    Record(1, "finished", java.sql.Date.valueOf("2015-07-07"), null),
    Record(2, "food"    , java.sql.Date.valueOf("2015-07-09"), null),
    Record(2, "cool"    , java.sql.Date.valueOf("2015-07-11"), java.sql.Date.valueOf("2015-07-10"))))

val df = sqlContext.createDataFrame(rdd)

我们能做的第一件事就是将数据拆分为空值而不是空值:

val dfNotNull = df.where(not($"salaried_on".isNull))
val dfNull = df.where($"salaried_on".isNull)
val dfNotNullRenamed = dfNotNull.
    withColumnRenamed("employee", "emp").
    withColumnRenamed("updated_on", "upd").
    withColumnRenamed("salaried_on", "sal").
    select("emp", "upd", "sal")

现在我们可以在两者上使用左外连接并填补空白:

val joinedWithFuture = dfNull.join(
  dfNotNullRenamed, df("employee") <=> dfNotNullRenamed("emp") && 
  dfNotNullRenamed("sal") >= df("updated_on"),
  "left_outer"
).withColumn("salaried_on", coalesce($"salaried_on", $"sal")).
  drop("emp").drop("sal")

最后,我们可以使用row_number进行过滤,并使用非空值进行合并:

joinedWithFuture.registerTempTable("joined_with_future")

val query = """SELECT * FROM (SELECT *, row_number() OVER (
  PARTITION BY employee, action, updated_on
  ORDER BY ABS(CAST(timestamp(upd) as INT) - CAST(timestamp(updated_on) as INT))
) rn FROM joined_with_future) tmp WHERE rn = 1"""

val dfNullImputed = sqlContext.
  sql(query).
  drop("rn").
  drop("upd").
  unionAll(dfNotNull).
  orderBy("employee", "updated_on")

如果还有差距,请重复整个过程,dfNotNullRenamed("sal") >= df("updated_on")替换为dfNotNullRenamed("sal") < df("updated_on")