如何使用特定的UDF恢复列值?

时间:2019-08-14 10:30:11

标签: scala dataframe apache-spark

我有一个如下数据框:

+---------+--------+-------+
|date     |id      |typ_mvt|
+---------+--------+-------+
|date_1   |5697    |C      |
|date_2   |5697    |M      |
|date_3   |NULL    |M      |
|date_4   |NULL    |S      |
+---------+--------+-------+

我要恢复id(NULL)值,如下所示:

+---------+--------+-------+
|date     |id      |typ_mvt|
+---------+--------+-------+
|date_1   |5697    |C      |
|date_2   |5697    |M      |
|date_3   |5697    |M      |
|date_4   |5697    |S      |
+---------+--------+-------+

有没有办法做到这一点?

谢谢您的回答。

1 个答案:

答案 0 :(得分:0)

Bonjour Doc, Le na.fill fait bien le taff:

val rdd = sc.parallelize(Seq(
(201901, new Integer(5697), "C"),
(201902, new Integer(5697), "M"),
(201903, null.asInstanceOf[Integer], "M"),
(201904, null.asInstanceOf[Integer], "S")
))

val df = rdd.toDF("date", "id", "typ_mvt")

import org.apache.spark.sql.functions.{lag,lead}
val window = org.apache.spark.sql.expressions.Window.orderBy("date") 
val sampleId = df.filter($"id".isNotNull).select($"id").first.getInt(0)
val newDf = df.na.fill(sampleId,Seq("id"))

锡农(Sinon),副总理特雷维斯(SilvantTrêssimilaire avec une bien meilleur)解决方案: Fill in null with previously known good value with pyspark