Spark填充DataFrame,Vector为null

时间:2017-08-06 11:32:19

标签: scala apache-spark dataframe vector null

我有一个DataFrame,它包含由VectorAssembler创建的特征向量,它还包含空值。我现在想用向量替换空值:

 val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)

df.na.fill(nil) // does not work.

这样做的正确方法是什么?

编辑: 我找到了一个感谢答案的方法:

val nil = Vectors.dense(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0,1.0, 1.0, 1.0, 1.0, 1.0)

import sc.implicits._
var fill = Seq(Tuple1(nil)).toDF("replacement")

val dates = data.schema.fieldNames.filter(e => e.contains("1"))

data = data.crossJoin(broadcast(fill))
for(e <- dates){
  data = data.withColumn(e, coalesce(data.col(e), $"replacement"))
}
data = data.drop("replacement")

1 个答案:

答案 0 :(得分:1)

如果通过添加一些额外的行来创建问题,那么您将加入替换行:

import org.apache.spark.sql.functions._

val df = Seq((1, None), (2, Some(nil))).toDF("id", "vector")
val fill = Seq(Tuple1(nil)).toDF("replacement")

df.crossJoin(broadcast(fill)).withColumn("vector", coalesce($"vector", $"replacement")).drop("replacement")