通过处理null,Spark Scala按行平均

时间:2018-11-07 13:00:42

标签: scala apache-spark

我有一个数据量大的数据框,列数为“ n”。

df_avg_calc: org.apache.spark.sql.DataFrame = [col1: double, col2: double ... 4 more fields]
+------------------+-----------------+------------------+-----------------+-----+-----+
|              col1|             col2|              col3|             col4| col5| col6|
+------------------+-----------------+------------------+-----------------+-----+-----+
|              null|             null|              null|             null| null| null|
|              14.0|              5.0|              73.0|             null| null| null|
|              null|             null|             28.25|             null| null| null|
|              null|             null|              null|             null| null| null|
|33.723333333333336|59.78999999999999|39.474999999999994|82.09666666666666|101.0|53.43|
|             26.25|             null|              null|              2.0| null| null|
|              null|             null|              null|             null| null| null|
|             54.46|           89.475|              null|             null| null| null|
|              null|            12.39|              null|             null| null| null|
|              null|             58.0|             19.45|              1.0| 1.33|158.0|
+------------------+-----------------+------------------+-----------------+-----+-----+

我需要进行行平均计算,不要考虑将“ null”的单元用于平均。

这需要在Spark / Scala中实现。我试图解释与所附图片相同

rowise average

到目前为止,我已经尝试过:

通过引荐-Calculate row mean, ignoring NAs in Spark Scala

val df = df_raw.schema.fieldNames.filter(f => f.contains("colname")) 
val rowMeans = df_raw.select(df.map(f => col(f)).reduce(+) / lit(df.length) as "row_mean") 

df_raw包含需要汇总的列(当然是rowise)。有超过80列。它们任意具有数据且为null,在计算平均值时,分母中的Null计数需要忽略。当所有列都包含数据时,即使列中的单个Null都返回Null,它也可以正常工作

修改

我尝试将this answer调整为Terry Dactyl

def average(l: Seq[Double]): Option[Double] = {
  val nonNull = l.flatMap(i => Option(i))
  if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}

val avgUdf = udf(average(_: Seq[Double]))

val rowAvgDF = df_avg_calc.select(avgUdf(array($"col1",$"col2",$"col3",$"col4",$"col5",$"col6")).as("row_avg"))
rowAvgDF.show(10,false)

rowAvgDF: org.apache.spark.sql.DataFrame = [row_avg: double]
+------------------+
|row_avg           |
+------------------+
|0.0               |
|15.333333333333334|
|4.708333333333333 |
|0.0               |
|61.58583333333333 |
|4.708333333333333 |
|0.0               |
|23.989166666666666|
|2.065             |
|39.63             |
+------------------+

2 个答案:

答案 0 :(得分:0)

火花> = 2.4

可以使用aggregate

val row_mean = expr("""aggregate(
  CAST(array(_1, _2, _3) AS array<double>), 
  -- Initial value
  -- Note that aggregate is picky about types
  CAST((0.0 as sum, 0.0 as n) AS struct<sum: double, n: double>), 
  -- Merge function
  (acc, x) -> (
    acc.sum + coalesce(x, 0.0), 
    acc.n + CASE WHEN x IS NULL THEN 0.0 ELSE 1.0 END), 
  -- Finalize function
  acc -> acc.sum / acc.n)""")

用法:

df.withColumn("row_mean", row_mean).show

结果:

+----+----+----+--------+
|  _1|  _2|  _3|row_mean|
+----+----+----+--------+
|null|null|null|    null|
| 2.0|null|null|     2.0|
|50.0|34.0|null|    42.0|
| 1.0| 2.0| 3.0|     2.0|
+----+----+----+--------+

版本无关

计算NOT NULL列的总和和计数并将它们除以另一个:

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._

def row_mean(cols: Column*) = {
  // Sum of values ignoring nulls
  val sum = cols
    .map(c => coalesce(c, lit(0)))
    .foldLeft(lit(0))(_ + _)
  // Count of not null values
  val cnt = cols
    .map(c => when(c.isNull, 0).otherwise(1))
    .foldLeft(lit(0))(_ + _)
  sum / cnt
}

示例数据:

val df = Seq(
  (None, None, None), 
  (Some(2.0), None, None),
  (Some(50.0), Some(34.0), None),
  (Some(1.0), Some(2.0), Some(3.0))
).toDF

结果:

df.withColumn("row_mean", row_mean($"_1", $"_2", $"_3")).show
+----+----+----+--------+
|  _1|  _2|  _3|row_mean|
+----+----+----+--------+
|null|null|null|    null|
| 2.0|null|null|     2.0|
|50.0|34.0|null|    42.0|
| 1.0| 2.0| 3.0|     2.0|
+----+----+----+--------+

答案 1 :(得分:-1)

def average(l: Seq[Integer]): Option[Double] = {
  val nonNull = l.flatMap(i => Option(i))
  if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}

val avgUdf = udf(average(_: Seq[Integer]))

val df = List((Some(1),Some(2)), (Some(1), None), (None, None)).toDF("a", "b")

val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))

avgDf.collect

res0: Array[org.apache.spark.sql.Row] = Array([1.5], [1.0], [null])

对您提供的数据进行测试可以得出正确的结果:

val df = List(
  (Some(10),Some(5), Some(5), None, None),
  (None, Some(5), Some(5), None, Some(5)),
  (Some(2), Some(8), Some(5), Some(1), Some(2)), 
  (None, None, None, None, None)
).toDF("col1", "col2", "col3", "col4", "col5")

Array[org.apache.spark.sql.Row] = Array([6.666666666666667], [5.0], [3.6], [null])

请注意,如果您有不想包含的列,请确保在填充传递给UDF的数组时对它们进行过滤。

最后:

val df = List(
  (Some(14), Some(5), Some(73), None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")

Array[org.apache.spark.sql.Row] = Array([30.666666666666668])

再次是正确的结果。

如果您想使用Doubles ...

def average(l: Seq[java.lang.Double]): Option[java.lang.Double] = {
  val nonNull = l.flatMap(i => Option(i))
  if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _) / nonNull.size.toDouble)
}

val avgUdf = udf(average(_: Seq[java.lang.Double]))

val df = List(
  (Some(14.0), Some(5.0), Some(73.0), None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")

val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))

avgDf.collect

Array[org.apache.spark.sql.Row] = Array([30.666666666666668])