使用spark将下一个Row值填充为null或为空

时间:2019-05-21 10:17:30

标签: scala apache-spark apache-spark-sql

是否有一种方法可以将火花数据框中的空值替换为下一行而不是空值。 Windows分区和排序增加了额外的row_count列。更具体地说,我想获得以下结果:

      +---------+-----------+      +---------+--------+
      | row_count |       id|      |row_count |     id|
      +---------+-----------+      +------+-----------+
      |        1|       null|      |     1|        109|
      |        2|        109|      |     2|        109|
      |        3|       null|      |     3|        108|
      |        4|       null|      |     4|        108|
      |        5|        108| =>   |     5|        108|
      |        6|       null|      |     6|        110|
      |        7|        110|      |     7|        110|
      |        8|       null|      |     8|       null|
      |        9|       null|      |     9|       null|
      |       10|       null|      |    10|       null|
      +---------+-----------+      +---------+--------+

我尝试了以下代码,但未给出正确的结果。

      val ss = dataframe.select($"*", sum(when(dataframe("id").isNull||dataframe("id") === "", 1).otherwise(0)).over(Window.orderBy($"row_count")) as "value")
      val window1=Window.partitionBy($"value").orderBy("id").rowsBetween(0, Long.MaxValue)
      val selectList=ss.withColumn("id_fill_from_below",last("id").over(window1)).drop($"row_count").drop($"value")

1 个答案:

答案 0 :(得分:0)

这是一种方法

  1. 过滤非null(dfNonNulls)
  2. 过滤null(dfNulls)
  3. 使用join和Window函数为空ID查找正确的值
  4. 填充空数据帧(dfNullFills)
  5. 联合dfNonNulls和dfNullFills

data.csv

row_count,id
1,
2,109
3,
4,
5,108
6,
7,110
8,
9,
10,
var df = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("data.csv")

var dfNulls = df.filter(
  $"id".isNull
).withColumnRenamed(
  "row_count","row_count_nulls"
).withColumnRenamed(
  "id","id_nulls"
)

val dfNonNulls = df.filter(
  $"id".isNotNull
).withColumnRenamed(
  "row_count","row_count_values"
).withColumnRenamed(
  "id","id_values"
)

dfNulls = dfNulls.join(
  dfNonNulls, $"row_count_nulls" lt $"row_count_values","left"
).select(
  $"id_nulls",$"id_values",$"row_count_nulls",$"row_count_values"
)

val window = Window.partitionBy("row_count_nulls").orderBy("row_count_values")

val dfNullFills = dfNulls.withColumn(
  "rn", row_number.over(window)
).where($"rn" === 1).drop("rn").select(
  $"row_count_nulls".alias("row_count"),$"id_values".alias("id"))

dfNullFills .union(dfNonNulls).orderBy($"row_count").show()

这将导致

+---------+----+
|row_count|  id|
+---------+----+
|        1| 109|
|        2| 109|
|        3| 108|
|        4| 108|
|        5| 108|
|        6| 110|
|        7| 110|
|        8|null|
|        9|null|
|       10|null|
+---------+----+