Question

我理解imputer应该如何工作，但我无法完全理解Spark中imputer的实现。我希望对以下代码进行初学者级解释：

val results = $(strategy) match {
  case Imputer.mean =>
    // Function avg will ignore null automatically.
    // For a column only containing null, avg will return null.
    val row = dataset.select(cols.map(avg): _*).head()
    Array.range(0, $(inputCols).length).map { i =>
      if (row.isNullAt(i)) {
        Double.NaN
      } else {
        row.getDouble(i)
      }
    }

  case Imputer.median =>
    // Function approxQuantile will ignore null automatically.
    // For a column only containing null, approxQuantile will return an empty array.
    dataset.select(cols: _*).stat.approxQuantile($(inputCols), Array(0.5), 0.001)
      .map { array =>
        if (array.isEmpty) {
          Double.NaN
        } else {
          array.head
        }
      }
}

我的理解：

伪代码中imputer的两种策略的逻辑。
基本scala＆amp; Spark数据帧。

Imputer.mean中我不明白的地方：

为什么我们这里有val row？为什么我们有.head（）`？
如何在Imputer.mean中估算缺失值？我看到每个col的平均值是如何计算的，但我不知道它们是如何得到估算的。是row.getDouble(i)
这是Array的内容？它在哪里宣布？是否与val row有任何关系？

Imputer.median中我不明白的地方：

我们不是按dataset.select(cols: _*).stat.approxQuantile($(inputCols), Array(0.5), 0.001)计算中位数吗？为什么我们这里有array？我们为什么要退回array.head？

Answer 1

为什么我们在这里有val行？

因为它是变量的描述性名称，因为值的类型是Row。

为什么我们有.head（）`？

因为查询单独返回DataFrame，我们想要查询的结果。

如何在Imputer.mean中估算缺失值？

不是。归集逻辑由ImputerModel.transform

实现

override def transform(dataset: Dataset[_]): DataFrame = {
  transformSchema(dataset.schema, logging = true)
  val surrogates = surrogateDF.select($(inputCols).map(col): _*).head().toSeq


  val newCols = $(inputCols).zip($(outputCols)).zip(surrogates).map {
    case ((inputCol, outputCol), surrogate) =>
      val inputType = dataset.schema(inputCol).dataType
      val ic = col(inputCol)
      when(ic.isNull, surrogate)
        .when(ic === $(missingValue), surrogate)
        .otherwise(ic)
        .cast(inputType)
  }
  dataset.withColumns($(outputCols), newCols).toDF()
}

这是什么数组

因为Imputer旨在一次归入多个值。所以你需要一个结果集合，而Row虽然类似集合，但却没有输入。

为什么我们这里有阵列？为什么我们返回array.head？

在您复制到问题中的评论中详细解释了这一点：

For a column only containing null, approxQuantile will return an empty array.

了解Spark Imputer

1 个答案: