缩放不带UDF的SparseVectors列

时间:2018-11-30 18:01:20

标签: scala apache-spark

我有一个带有SparseVector特征列的数据框。我需要按标量缩放每个 row 。我下面有一个使用UDF的有效实现。以下描述了原始要素列和按比例缩放的要素列:

+-------------------+-------+-------------------+
|           features|weights|     scaledFeatures|
+-------------------+-------+-------------------+
|(6,[0,1],[0.5,1.0])|    1.0|(6,[0,1],[0.5,1.0])|
|(6,[2,3],[1.5,2.0])|    2.0|(6,[2,3],[3.0,4.0])|
|(6,[4,5],[0.5,1.0])|    3.0|(6,[4,5],[1.5,3.0])|
+-------------------+-------+-------------------+

是否有一种方法可以使用Spark的本机且经过优化的方法代替UDF?

类似地,是否有Spark-native方法通过标量缩放SparseVector?请参见下面定义的UDF中“缩放稀疏向量”注释下面的行。

import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf}

// Scaling a SparseVector column
val data = Array(
  (new SparseVector(6, Array(0,1), Array(0.5, 1.0)), 1.0),
  (new SparseVector(6, Array(2,3), Array(1.5, 2.0)), 2.0),
  (new SparseVector(6, Array(4,5), Array(0.5, 1.0)), 3.0)
)

val df = spark.createDataFrame(data).toDF("features", "weights")

val scaleUDF = udf((sv: SparseVector, w: Double) => {
  // Scale the SparseVector
  val unzipped = sv.indices.zip(sv.values).map(iv => (iv._1, iv._2*w)).unzip
  new SparseVector(sv.size, unzipped._1, unzipped._2)
})

val scaledDF = df.withColumn("scaledFeatures", scaleUDF(col("features"), col("weights")))
scaledDF.show()
+-------------------+-------+-------------------+
|           features|weights|     scaledFeatures|
+-------------------+-------+-------------------+
|(6,[0,1],[0.5,1.0])|    1.0|(6,[0,1],[0.5,1.0])|
|(6,[2,3],[1.5,2.0])|    2.0|(6,[2,3],[3.0,4.0])|
|(6,[4,5],[0.5,1.0])|    3.0|(6,[4,5],[1.5,3.0])|
+-------------------+-------+-------------------+

0 个答案:

没有答案