Question

使用Spark 1.4.0，Scala 2.10

我一直试图找出一种方法来使用最后一次已知的观察来转发填充空值，但我没有看到一种简单的方法。我认为这是一件很常见的事情，但找不到一个展示如何做到这一点的例子。

我看到函数向前转移填充NaN的值，或滞后/超前函数以填充或移位数据偏移量，但没有任何东西可以获取最后的已知值。

在网上看，我在R中看到很多关于同一件事的Q / A，但在Spark / Scala中没有。

我正在考虑在日期范围内进行映射，从结果中过滤出NaN并选择最后一个元素，但我想我对语法感到困惑。

使用DataFrames尝试类似

的内容

import org.apache.spark.sql.expressions.Window

val sqlContext = new HiveContext(sc)

var spec = Window.orderBy("Date")
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("test.csv")

val df2 = df.withColumn("testForwardFill", (90 to 0).map(i=>lag(df.col("myValue"),i,0).over(spec)).filter(p=>p.getItem.isNotNull).last)

但这并不能让我随处可见。

过滤器部分不起作用; map函数返回一个spark.sql.Columns序列，但是filter函数需要返回一个Boolean，所以我需要从Column中获取一个值来测试，但似乎只有Column方法返回一个Column。

有没有办法在Spark上“更简单”地做到这一点？

感谢您的输入

修改：

简单示例示例输入：

2015-06-01,33
2015-06-02,
2015-06-03,
2015-06-04,
2015-06-05,22
2015-06-06,
2015-06-07,
...

预期产出：

2015-06-01,33
2015-06-02,33
2015-06-03,33
2015-06-04,33
2015-06-05,22
2015-06-06,22
2015-06-07,22

注意：

我有很多列，其中很多都有这种缺失的数据模式，但不是在同一个日期/时间。如果我需要，我将一次只进行一次变换。

修改：

关注@ zero323的回答我试过这种方式：

    import org.apache.spark.sql.Row
    import org.apache.spark.rdd.RDD

    val rows: RDD[Row] = df.orderBy($"Date").rdd


    def notMissing(row: Row): Boolean = { !row.isNullAt(1) }

    val toCarry: scala.collection.Map[Int,Option[org.apache.spark.sql.Row]] = rows.mapPartitionsWithIndex{
   case (i, iter) => Iterator((i, iter.filter(notMissing(_)).toSeq.lastOption)) }
.collectAsMap

    val toCarryBd = sc.broadcast(toCarry)

    def fill(i: Int, iter: Iterator[Row]): Iterator[Row] = { if (iter.contains(null)) iter.map(row => Row(toCarryBd.value(i).get(1))) else iter }

    val imputed: RDD[Row] = rows.mapPartitionsWithIndex{ case (i, iter) => fill(i, iter)}

广播变量最终作为没有空值的值列表。这是进步但我仍然无法使映射工作。但我什么都没得到，因为它中的索引i没有映射到原始数据，它映射到没有null的子集。

我在这里缺少什么？

编辑和解决方案（来自@ zero323的答案）：

import org.apache.spark.sql.expressions.Window

val sqlContext = new HiveContext(sc)

var spec = Window.partitionBy("id").orderBy("Date")
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("test.csv")

val df2 = df.withColumn("test", coalesce((0 to 90).map(i=>lag(df.col("test"),i,0).over(spec)): _*))

如果您使用的是RDD而不是DataFrame，请参阅下面的零323答案以获取更多选项。上面的解决方案可能不是最有效的，但对我有用。如果您正在寻求优化，请查看RDD解决方案。

Answer 1

初始答案（单个时间序列假设）：

首先，如果您不能提供PARTITION BY子句，请尝试避免使用窗口函数。它将数据移动到单个分区，因此大多数情况下它根本不可行。

您可以使用RDD填补mapPartitionsWithIndex上的空白。由于您没有提供示例数据或预期输出，因此将此视为伪代码而不是真正的Scala程序：

首先按日期订购DataFrame并转换为RDD

import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD

val rows: RDD[Row] = df.orderBy($"Date").rdd

接下来让我们找到每个分区的最后一个非空的观察

def notMissing(row: Row): Boolean = ???

val toCarry: scala.collection.Map[Int,Option[org.apache.spark.sql.Row]] = rows
  .mapPartitionsWithIndex{ case (i, iter) => 
    Iterator((i, iter.filter(notMissing(_)).toSeq.lastOption)) }
  .collectAsMap

并将此Map转换为广播
```
val toCarryBd = sc.broadcast(toCarry)
```

最后再次映射分区，填补空白：

def fill(i: Int, iter: Iterator[Row]): Iterator[Row] = {
  // If it is the beginning of partition and value is missing
  // extract value to fill from toCarryBd.value
  // Remember to correct for empty / only missing partitions
  // otherwise take last not-null from the current partition
}

val imputed: RDD[Row] = rows
  .mapPartitionsWithIndex{ case (i, iter) => fill(i, iter) }

最后转换回DataFrame

编辑（每组数据的分区/时间序列）：

魔鬼在细节中。如果您的数据完全被分区，那么可以使用groupBy解决整个问题。让我们假设您只需按列分组＆＃34; v＆＃34;类型T和Date是整数时间戳：

def fill(iter: List[Row]): List[Row] = {
  // Just go row by row and fill with last non-empty value
  ???
}

val groupedAndSorted = df.rdd
  .groupBy(_.getAs[T]("k"))
  .mapValues(_.toList.sortBy(_.getAs[Int]("Date")))

val rows: RDD[Row] = groupedAndSorted.mapValues(fill).values.flatMap(identity)

val dfFilled = sqlContext.createDataFrame(rows, df.schema)

这样您就可以同时填充所有列。

可以使用DataFrames而不是来回转换为RDD吗？

这取决于，虽然它不太可能有效。如果最大差距相对较小，您可以执行以下操作：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{WindowSpec, Window}
import org.apache.spark.sql.Column

val maxGap: Int = ???  // Maximum gap between observations
val columnsToFill: List[String] = ???  // List of columns to fill
val suffix: String = "_" // To disambiguate between original and imputed 

// Take lag 1 to maxGap and coalesce
def makeCoalesce(w: WindowSpec)(magGap: Int)(suffix: String)(c: String) = {
  // Generate lag values between 1 and maxGap
  val lags = (1 to maxGap).map(lag(col(c), _)over(w))
  // Add current, coalesce and set alias
  coalesce(col(c) +: lags: _*).alias(s"$c$suffix")
}


// For each column you want to fill nulls apply makeCoalesce
val lags: List[Column] = columnsToFill.map(makeCoalesce(w)(maxGap)("_"))


// Finally select
val dfImputed = df.select($"*" :: lags: _*)

可以轻松调整每列使用不同的最大间隙。

在最新的Spark版本中获得类似结果的一种简单方法是将last与ignoreNulls一起使用：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy($"k").orderBy($"Date")
  .rowsBetween(Window.unboundedPreceding, -1)

df.withColumn("value", coalesce($"value", last($"value", true).over(w)))

虽然可以删除partitionBy子句并在全局范围内应用此方法，但对于大型数据集而言，这会非常昂贵。

Answer 2

只能使用Window函数（不带最后一个函数）和某种巧妙的分区来做到这一点。我个人真的不喜欢必须使用groupBy的组合然后再加入。

所以给定了：

date,      currency, rate
20190101   JPY       NULL
20190102   JPY       2
20190103   JPY       NULL
20190104   JPY       NULL
20190102   JPY       3
20190103   JPY       4
20190104   JPY       NULL

我们可以使用Window.unboundedPreceding和Window.unboundedFollowing创建用于向前和向后填充的键。

以下代码：

val w1 = Window.partitionBy("currency").orderBy(asc("date"))
df
   .select("date", "currency", "rate")
   // Equivalent of fill.na(0, Seq("rate")) but can be more generic here
   // You may need an abs(col("rate")) if value col can be negative since it will not work with the following sums to build the foward and backward keys
   .withColumn("rate_filled", when(col("rate").isNull, lit(0)).otherwise(col("rate)))
   .withColumn("rate_backsum",
     sum("rate_filled").over(w1.rowsBetween(Window.unboundedPreceding, Window.currentRow)))
   .withColumn("rate_forwardsum",
     sum("rate_filled").over(w1.rowsBetween(Window.currentRow, Window.unboundedFollowing)))

给予：

date,      currency, rate,  rate_filled, rate_backsum, rate_forwardsum
20190101   JPY       NULL             0             0             9
20190102   JPY       2                2             2             9
20190103   JPY       NULL             0             2             7
20190104   JPY       NULL             0             2             7
20190102   JPY       3                3             5             7
20190103   JPY       4                4             9             4
20190104   JPY       NULL             0             9             0

因此，我们构建了两个键（x_backsum和x_forwardsum），可用于填充和填充。具有以下两条火花线：

val wb = Window.partitionBy("currency", "rate_backsum")
val wf = Window.partitionBy("currency", "rate_forwardsum")

   ...
   .withColumn("rate_backfilled", avg("rate").over(wb))
   .withColumn("rate_forwardfilled", avg("rate").over(wf))

最后：

date,      currency, rate,   rate_backsum, rate_forwardsum, rate_ffilled
20190101   JPY       NULL               0               9              2
20190102   JPY       2                  2               9              2
20190103   JPY       NULL               2               7              3
20190104   JPY       NULL               2               7              3
20190102   JPY       3                  5               7              3
20190103   JPY       4                  9               4              4
20190104   JPY       NULL               9               0              0

Spark / Scala：向前填充最后一次观察

2 个答案:

初始答案（单个时间序列假设）：

编辑（每组数据的分区/时间序列）：