我创建了一个BucketedRandomProjectionLSHModel以便找出数据集中每行的近似最近邻居。近似最近函数的签名是
def approxNearestNeighbors(
dataset: Dataset[_],
key: Vector,
numNearestNeighbors: Int): Dataset[_]
要在数据帧的每一行上运行它,我的想法是创建一些udf来调用此函数,并将结果数据集转换为ArrayType [StructType]的列。
假设我的初始模式是
root
|-- genderIndex: double (nullable = false)
|-- genderIndexVec: vector (nullable = true)
|-- categoryIndex: double (nullable = false)
|-- categoryIndexVec: vector (nullable = true)
|-- features: vector (nullable = true)
|-- featureStdDev: vector (nullable = true)
我的目标架构(调用.withColumn($“ featureStdDev”,udf ...)之后)是
root
|-- genderIndex: double (nullable = false)
|-- genderIndexVec: vector (nullable = true)
|-- categoryIndex: double (nullable = false)
|-- categoryIndexVec: vector (nullable = true)
|-- features: vector (nullable = true)
|-- featureStdDev: vector (nullable = true)
|-- neighbours: array(nullable = true)
|-- elem: struct
|-- genderIndex: double (nullable = false)
|-- genderIndexVec: vector (nullable = true)
|-- categoryIndex: double (nullable = false)
|-- categoryIndexVec: vector (nullable = true)
|-- features: vector (nullable = true)
|-- featureStdDev: vector (nullable = true)
请帮助我的UDF,因为我不确定如何使它工作。
val model = // BucketedRandomProjectionLSHModel definition
val inputDF = // Input definition
val nn = udf{ (featureVector: SparseVector, k: Int) =>
model.approxNearestNeighbors(inputDF, featureVector, k)
// What now...
}