我可以将datafarme列的值除以该列的最大值吗?
SparkSQL可以使用子查询将列值除以最大值。
%sql
SELECT cumulativeSum / (SELECT max(cumulativeSum) FROM singularValueDF)
FROM singularValueDF
我有几行来自SVD的奇异值。
val svd: SingularValueDecomposition[RowMatrix, Matrix] = matrix.computeSVD(numFeatures, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val singluarValues = s.toDense.values
val singularValueRDD = sc.parallelize(singluarValues)
singularValueRDD.toDF("singluar_value").show(5)
+------------------+
| singluar_value|
+------------------+
| 323503.703778161|
|109669.14717327854|
|101621.48745300347|
| 93843.81264344015|
| 87209.07876311651|
...
。
我需要累计奇异值。
coverage = cumulativeSum / max(cumulativeSum)
+------------------+-----------------+-------------------+
| singluar_value| cumulativeSum| coverage|
+------------------+-----------------+-------------------+
| 323503.703778161| 323503.703778161| 0.0613375619450355|
|109669.14717327854|433172.8509514396| 0.0821312592957559|
|101621.48745300347|534794.3384044431|0.10139908902629156|
| 93843.81264344015|628638.1510478833|0.11919224132702236|
| 87209.07876311651|715847.2298109998|0.13572742224869208|
...
我尝试只使用Dataframe一次获得此功能,但是没有用。
val svd: SingularValueDecomposition[RowMatrix, Matrix] = matrix.computeSVD(numFeatures, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val singluarValues = s.toDense.values
val windowSpec = Window
.orderBy(desc("singluar_value"))
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val coverageDF = sc.parallelize(singluarValues).toDF("singluar_value")
.withColumn(
"cumulativeSum",
sum(col("singluar_value")).over(windowSpec)
)
.withColumn(
"coverage",
col("cumulativeSum") / max(col("cumulativeSum"))
)
出现错误。
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and '`singluar_value`' is not an aggregate function. Wrap '((`cumulativeSum` / max(`cumulativeSum`)) AS `coverage`)' in windowing function(s) or wrap '`singluar_value`' in first() (or first_value) if you don't care which value you get.;;
Aggregate [singluar_value#8430, cumulativeSum#8433, (cumulativeSum#8433 / max(cumulativeSum#8433)) AS coverage#8437]
+- Project [singluar_value#8430, cumulativeSum#8433]
+- Project [singluar_value#8430, cumulativeSum#8433, cumulativeSum#8433]
+- Window [sum(singluar_value#8430) windowspecdefinition(singluar_value#8430 DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS cumulativeSum#8433], [singluar_value#8430 DESC NULLS LAST]
+- Project [singluar_value#8430]
+- Project [value#8428 AS singluar_value#8430]
+- SerializeFromObject [input[0, double, false] AS value#8428]
+- ExternalRDD [obj#8427]
首先获取最大值,然后将其与文字(lit)函数一起使用,但这太麻烦了。
val windowSpec = Window
.orderBy(desc("singluar_value"))
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val singularValueRDD = sc.parallelize(singluarValues)
val singularValueDF = singularValueRDD.toDF("singluar_value")
.withColumn(
"cumulativeSum",
sum(col("singluar_value")).over(windowSpec)
)
val total = singularValueDF.select(max(col("cumulativeSum"))).collect()(0).getDouble(0)
val coverageDF = singularValueDF
.withColumn(
"coverage",
col("cumulativeSum") / lit(total)
)
coverageDF.show(5)
答案 0 :(得分:0)
val windowSpec = Window
.orderBy(desc("singluar_value"))
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpecAll = Window
.orderBy(desc("singluar_value"))
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val coverageDF = sc.parallelize(singluarValues).toDF("singluar_value")
.withColumn(
"id",
row_number().over(windowSpec)
)
.select("id", "singluar_value")
.withColumn(
"eignvalue",
pow(col("singluar_value"), 2) / lit(numSamples -1)
)
.withColumn(
"cumulativeSum",
sum(col("eignvalue")).over(windowSpec)
)
.withColumn(
"coverage",
col("cumulativeSum") / last(col("cumulativeSum")).over(windowSpecAll)
)
coverageDF.show(5)
+---+------------------+------------------+------------------+-------------------+
| id| singluar_value| eignvalue| cumulativeSum| coverage|
+---+------------------+------------------+------------------+-------------------+
| 1| 323503.703778161| 2491836.623685996| 2491836.623685996|0.43500390900934977|
| 2|109669.14717327854| 286371.6241271037| 2778208.2478131|0.48499626193511053|
| 3|101621.48745300347|245885.06183863763|3024093.3096517376| 0.5279208108602296|
| 4| 93843.81264344015|209687.40140139285| 3233780.71105313| 0.5645262762477199|
| 5| 87209.07876311651|181085.82153649992| 3414866.53258963| 0.5961387180449769|
+---+------------------+------------------+------------------+-------------------+