假设我有一个包含以下四列的DataFrame:
Employee Action Updated on Salaried on
1 emailed 2015-07-01 2015-07-12
1 worked 2015-07-03 null
1 played 2015-07-06 2015-07-28
1 finished 2015-07-07 null
2 food 2015-07-09 null
2 cool 2015-07-11 2015-07-10
答案应该是:
Employee Action Updated on Salaried on
1 emailed 2015-07-01 2015-07-12
1 worked 2015-07-03 2015-07-28
1 played 2015-07-06 2015-07-28
1 finished 2015-07-07 2015-07-28
2 food 2015-07-09 2015-07-10
2 cool 2015-07-11 2015-07-10
发生了什么事?
对于每位员工,根据更新的准时,如果“薪水”中的任何条目为空,则将从同一员工的最近的未来获取“薪水”值,否则它将从同一列中获取相同的值最近的。
例如 第5行从第6行取值。 第4行从第3行取值 第2行从第3行取值。 注意:未来将获得优势
我的尝试:我尝试过使用map&减少,但我们有一个很好的技术,以更好的方式解决它的火花强度?
答案 0 :(得分:2)
如果您假设无限数量的条目,可能的差距大小以及您对a comment中描述的不限时间窗口的值感兴趣,那么您所能做的就是希望Catalyst优化器能够能够聪明地做点什么。首先让我们重现示例数据:
import org.apache.spark.sql.functions.{coalesce, not}
case class Record(employee: Int, action: String, updated_on: java.sql.Date, salaried_on: java.sql.Date)
val rdd = sc.parallelize(List(
Record(1, "emailed" , java.sql.Date.valueOf("2015-07-01"), java.sql.Date.valueOf("2015-07-12")),
Record(1, "worked" , java.sql.Date.valueOf("2015-07-03"), null),
Record(1, "played" , java.sql.Date.valueOf("2015-07-06"), java.sql.Date.valueOf("2015-07-28")),
Record(1, "finished", java.sql.Date.valueOf("2015-07-07"), null),
Record(2, "food" , java.sql.Date.valueOf("2015-07-09"), null),
Record(2, "cool" , java.sql.Date.valueOf("2015-07-11"), java.sql.Date.valueOf("2015-07-10"))))
val df = sqlContext.createDataFrame(rdd)
我们能做的第一件事就是将数据拆分为空值而不是空值:
val dfNotNull = df.where(not($"salaried_on".isNull))
val dfNull = df.where($"salaried_on".isNull)
val dfNotNullRenamed = dfNotNull.
withColumnRenamed("employee", "emp").
withColumnRenamed("updated_on", "upd").
withColumnRenamed("salaried_on", "sal").
select("emp", "upd", "sal")
现在我们可以在两者上使用左外连接并填补空白:
val joinedWithFuture = dfNull.join(
dfNotNullRenamed, df("employee") <=> dfNotNullRenamed("emp") &&
dfNotNullRenamed("sal") >= df("updated_on"),
"left_outer"
).withColumn("salaried_on", coalesce($"salaried_on", $"sal")).
drop("emp").drop("sal")
最后,我们可以使用row_number
进行过滤,并使用非空值进行合并:
joinedWithFuture.registerTempTable("joined_with_future")
val query = """SELECT * FROM (SELECT *, row_number() OVER (
PARTITION BY employee, action, updated_on
ORDER BY ABS(CAST(timestamp(upd) as INT) - CAST(timestamp(updated_on) as INT))
) rn FROM joined_with_future) tmp WHERE rn = 1"""
val dfNullImputed = sqlContext.
sql(query).
drop("rn").
drop("upd").
unionAll(dfNotNull).
orderBy("employee", "updated_on")
如果还有差距,请重复整个过程,dfNotNullRenamed("sal") >= df("updated_on")
替换为dfNotNullRenamed("sal") < df("updated_on")
。