Spark DataFrame-如何将列值除以最大列值

时间:2019-12-25 03:18:27

标签: apache-spark-sql

问题

我可以将datafarme列的值除以该列的最大值吗?

SparkSQL可以使用子查询将列值除以最大值。

%sql
SELECT cumulativeSum / (SELECT max(cumulativeSum) FROM singularValueDF) 
FROM singularValueDF

背景

我有几行来自SVD的奇异值。

val svd: SingularValueDecomposition[RowMatrix, Matrix] = matrix.computeSVD(numFeatures, computeU = true)
val U: RowMatrix = svd.U  // The U factor is a RowMatrix.
val s: Vector = svd.s     // The singular values are stored in a local dense vector.
val V: Matrix = svd.V     // The V factor is a local dense matrix.


val singluarValues = s.toDense.values
val singularValueRDD = sc.parallelize(singluarValues)
singularValueRDD.toDF("singluar_value").show(5)

+------------------+
|    singluar_value|
+------------------+
|  323503.703778161|
|109669.14717327854|
|101621.48745300347|
| 93843.81264344015|
| 87209.07876311651|
...

我需要累计奇异值。

coverage = cumulativeSum / max(cumulativeSum)
+------------------+-----------------+-------------------+
|    singluar_value|    cumulativeSum|           coverage|
+------------------+-----------------+-------------------+
|  323503.703778161| 323503.703778161| 0.0613375619450355|
|109669.14717327854|433172.8509514396| 0.0821312592957559|
|101621.48745300347|534794.3384044431|0.10139908902629156|
| 93843.81264344015|628638.1510478833|0.11919224132702236|
| 87209.07876311651|715847.2298109998|0.13572742224869208| 
...

提示

我尝试只使用Dataframe一次获得此功能,但是没有用。

val svd: SingularValueDecomposition[RowMatrix, Matrix] = matrix.computeSVD(numFeatures, computeU = true)
val U: RowMatrix = svd.U  // The U factor is a RowMatrix.
val s: Vector = svd.s     // The singular values are stored in a local dense vector.
val V: Matrix = svd.V     // The V factor is a local dense matrix.


val singluarValues = s.toDense.values
val windowSpec = Window
  .orderBy(desc("singluar_value"))
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

val coverageDF = sc.parallelize(singluarValues).toDF("singluar_value")
    .withColumn(
        "cumulativeSum", 
        sum(col("singluar_value")).over(windowSpec)
    )
    .withColumn(
        "coverage", 
        col("cumulativeSum") / max(col("cumulativeSum"))
    )

出现错误。

org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`singluar_value`' is not an aggregate function. Wrap '((`cumulativeSum` / max(`cumulativeSum`)) AS `coverage`)' in windowing function(s) or wrap '`singluar_value`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [singluar_value#8430, cumulativeSum#8433, (cumulativeSum#8433 / max(cumulativeSum#8433)) AS coverage#8437]
+- Project [singluar_value#8430, cumulativeSum#8433]
   +- Project [singluar_value#8430, cumulativeSum#8433, cumulativeSum#8433]
      +- Window [sum(singluar_value#8430) windowspecdefinition(singluar_value#8430 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS cumulativeSum#8433], [singluar_value#8430 DESC NULLS LAST]
         +- Project [singluar_value#8430]
            +- Project [value#8428 AS singluar_value#8430]
               +- SerializeFromObject [input[0, double, false] AS value#8428]
                  +- ExternalRDD [obj#8427]

解决方法

首先获取最大值,然后将其与文字(lit)函数一起使用,但这太麻烦了。

val windowSpec = Window
  .orderBy(desc("singluar_value"))
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

val singularValueRDD = sc.parallelize(singluarValues)
val singularValueDF = singularValueRDD.toDF("singluar_value")
    .withColumn(
        "cumulativeSum", 
        sum(col("singluar_value")).over(windowSpec)
    )
val total = singularValueDF.select(max(col("cumulativeSum"))).collect()(0).getDouble(0)
val coverageDF = singularValueDF
    .withColumn(
        "coverage", 
        col("cumulativeSum") / lit(total)
    )

coverageDF.show(5)

1 个答案:

答案 0 :(得分:0)

val windowSpec = Window
  .orderBy(desc("singluar_value"))
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

val windowSpecAll = Window
  .orderBy(desc("singluar_value"))
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

val coverageDF = sc.parallelize(singluarValues).toDF("singluar_value")
    .withColumn(
        "id",
        row_number().over(windowSpec)
    )
    .select("id", "singluar_value")
    .withColumn(
        "eignvalue", 
        pow(col("singluar_value"), 2) / lit(numSamples -1)
    )
    .withColumn(
        "cumulativeSum", 
        sum(col("eignvalue")).over(windowSpec)
    )
    .withColumn(
        "coverage", 
        col("cumulativeSum") / last(col("cumulativeSum")).over(windowSpecAll)
    )
coverageDF.show(5)

+---+------------------+------------------+------------------+-------------------+
| id|    singluar_value|         eignvalue|     cumulativeSum|           coverage|
+---+------------------+------------------+------------------+-------------------+
|  1|  323503.703778161| 2491836.623685996| 2491836.623685996|0.43500390900934977|
|  2|109669.14717327854| 286371.6241271037|   2778208.2478131|0.48499626193511053|
|  3|101621.48745300347|245885.06183863763|3024093.3096517376| 0.5279208108602296|
|  4| 93843.81264344015|209687.40140139285|  3233780.71105313| 0.5645262762477199|
|  5| 87209.07876311651|181085.82153649992|  3414866.53258963| 0.5961387180449769|
+---+------------------+------------------+------------------+-------------------+