我有一个如下数据框:
+---------+--------+-------+
|date |id |typ_mvt|
+---------+--------+-------+
|date_1 |5697 |C |
|date_2 |5697 |M |
|date_3 |NULL |M |
|date_4 |NULL |S |
+---------+--------+-------+
我要恢复id(NULL)值,如下所示:
+---------+--------+-------+
|date |id |typ_mvt|
+---------+--------+-------+
|date_1 |5697 |C |
|date_2 |5697 |M |
|date_3 |5697 |M |
|date_4 |5697 |S |
+---------+--------+-------+
有没有办法做到这一点?
谢谢您的回答。
答案 0 :(得分:0)
Bonjour Doc, Le na.fill fait bien le taff:
val rdd = sc.parallelize(Seq(
(201901, new Integer(5697), "C"),
(201902, new Integer(5697), "M"),
(201903, null.asInstanceOf[Integer], "M"),
(201904, null.asInstanceOf[Integer], "S")
))
val df = rdd.toDF("date", "id", "typ_mvt")
import org.apache.spark.sql.functions.{lag,lead}
val window = org.apache.spark.sql.expressions.Window.orderBy("date")
val sampleId = df.filter($"id".isNotNull).select($"id").first.getInt(0)
val newDf = df.na.fill(sampleId,Seq("id"))
锡农(Sinon),副总理特雷维斯(SilvantTrêssimilaire avec une bien meilleur)解决方案: Fill in null with previously known good value with pyspark