在UDF中从数据框创建ArrayType [StructType]的列

时间:2019-04-10 16:40:09

标签: scala apache-spark apache-spark-sql apache-spark-ml

我创建了一个BucketedRandomProjectionLSHModel以便找出数据集中每行的近似最近邻居。近似最近函数的签名是

def approxNearestNeighbors(
      dataset: Dataset[_],
      key: Vector,
      numNearestNeighbors: Int): Dataset[_] 

要在数据帧的每一行上运行它,我的想法是创建一些udf来调用此函数,并将结果数据集转换为ArrayType [StructType]的列。

假设我的初始模式是

root
 |-- genderIndex: double (nullable = false)
 |-- genderIndexVec: vector (nullable = true)
 |-- categoryIndex: double (nullable = false)
 |-- categoryIndexVec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- featureStdDev: vector (nullable = true)

我的目标架构(调用.withColumn($“ featureStdDev”,udf ...)之后)是

root
 |-- genderIndex: double (nullable = false)
 |-- genderIndexVec: vector (nullable = true)
 |-- categoryIndex: double (nullable = false)
 |-- categoryIndexVec: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- featureStdDev: vector (nullable = true)
 |-- neighbours: array(nullable = true)
      |-- elem: struct
           |-- genderIndex: double (nullable = false)
           |-- genderIndexVec: vector (nullable = true)
           |-- categoryIndex: double (nullable = false)
           |-- categoryIndexVec: vector (nullable = true)
           |-- features: vector (nullable = true)
           |-- featureStdDev: vector (nullable = true)

请帮助我的UDF,因为我不确定如何使它工作。

val model = // BucketedRandomProjectionLSHModel definition
val inputDF = // Input definition
val nn = udf{ (featureVector: SparseVector, k: Int) =>
      model.approxNearestNeighbors(inputDF, featureVector, k)
      // What now...
    }

0 个答案:

没有答案