我有一个带有SparseVector特征列的数据框。我需要按标量缩放每个 row 。我下面有一个使用UDF的有效实现。以下描述了原始要素列和按比例缩放的要素列:
+-------------------+-------+-------------------+
| features|weights| scaledFeatures|
+-------------------+-------+-------------------+
|(6,[0,1],[0.5,1.0])| 1.0|(6,[0,1],[0.5,1.0])|
|(6,[2,3],[1.5,2.0])| 2.0|(6,[2,3],[3.0,4.0])|
|(6,[4,5],[0.5,1.0])| 3.0|(6,[4,5],[1.5,3.0])|
+-------------------+-------+-------------------+
是否有一种方法可以使用Spark的本机且经过优化的方法代替UDF?
类似地,是否有Spark-native方法通过标量缩放SparseVector?请参见下面定义的UDF中“缩放稀疏向量”注释下面的行。
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf}
// Scaling a SparseVector column
val data = Array(
(new SparseVector(6, Array(0,1), Array(0.5, 1.0)), 1.0),
(new SparseVector(6, Array(2,3), Array(1.5, 2.0)), 2.0),
(new SparseVector(6, Array(4,5), Array(0.5, 1.0)), 3.0)
)
val df = spark.createDataFrame(data).toDF("features", "weights")
val scaleUDF = udf((sv: SparseVector, w: Double) => {
// Scale the SparseVector
val unzipped = sv.indices.zip(sv.values).map(iv => (iv._1, iv._2*w)).unzip
new SparseVector(sv.size, unzipped._1, unzipped._2)
})
val scaledDF = df.withColumn("scaledFeatures", scaleUDF(col("features"), col("weights")))
scaledDF.show()
+-------------------+-------+-------------------+
| features|weights| scaledFeatures|
+-------------------+-------+-------------------+
|(6,[0,1],[0.5,1.0])| 1.0|(6,[0,1],[0.5,1.0])|
|(6,[2,3],[1.5,2.0])| 2.0|(6,[2,3],[3.0,4.0])|
|(6,[4,5],[0.5,1.0])| 3.0|(6,[4,5],[1.5,3.0])|
+-------------------+-------+-------------------+