Spark:两列向量之间的欧几里得距离

时间:2019-04-23 13:15:07

标签: scala apache-spark

我有两个数据帧df1df2,它们具有以下结构:

print(df1)
+-------+------------+-------------+---------+
|     id|      vector|  start_time | end_time|
+-------+------------+-------------+---------+
|      1| [0,0,0,0,0]|          000|      200|
|      2| [1,1,1,1,1]|          200|      500|
|      3| [0,1,0,1,0]|          100|      500|
+-------+------------+-------------+---------+

print(df2)
+-------+------------+-------+
|     id|      vector|   time|
+-------+------------+-------+
|      A| [0,1,1,1,0]|    050|
|      B| [1,0,0,1,1]|    150|
|      C| [1,1,1,1,1]|    250|
|      D| [1,0,1,0,1]|    350|
|      E| [1,1,1,1,1]|    450|
|      F| [1,0,5,0,0]|    550|
+-------+------------+-------+

我想要的是:对于df1的每个数据,从df2timestart_time之间的end_time获取所有数据,并为所有这些数据都计算出两个向量之间的欧式距离。

我从下面的代码开始,但是我在计算距离的过程中陷入了困境:

val joined_DF = kafka_DF.crossJoin(
        hdfs_DF.withColumnRenamed("id","id2").withColumnRenamed("vector","vector2")
    )
      .filter(col("time")>= col("start_time") &&
        col("time")<= col("end_time"))
        .withColumn("distance", ???) // Euclidean distance element-wise between columns vector and column vector2

以下是示例数据上的预期输出:

+-------+------------+-------------+---------+-------+------------+------+----------+
|     id|      vector|  start_time | end_time|    id2|     vector2|  time| distance |
+-------+------------+-------------+---------+-------+------------+------+----------+
|      1| [0,0,0,0,0]|          000|      200|      A| [0,1,1,1,0]|   050|   1.73205|
|      1| [0,0,0,0,0]|          000|      200|      B| [1,0,0,1,1]|   150|   1.73205|
|      2| [1,1,1,1,1]|          200|      500|      C| [1,1,1,1,1]|   250|         0|
|      2| [1,1,1,1,1]|          200|      500|      D| [1,0,1,0,1]|   350|   1.41421|
|      2| [1,1,1,1,1]|          200|      500|      E| [1,1,1,1,1]|   450|         0|
|      3| [0,1,0,1,0]|          100|      500|      B| [1,0,0,1,1]|   150|   1.73205|
|      3| [0,1,0,1,0]|          100|      500|      C| [1,1,1,1,1]|   250|   1.73205|
|      3| [0,1,0,1,0]|          100|      500|      D| [1,0,1,0,1]|   350|   2.23606|
|      3| [0,1,0,1,0]|          100|      500|      E| [1,1,1,1,1]|   450|   1.73205|
+-------+------------+-------------+---------+-------+------------+------+----------+

注意:

  • df1总是会有少量数据,因此crossJoin不会冒充我的内存的风险。
  • 我的数据框是使用结构化流API创建的。
  • 我正在使用Spark 2.3.2

1 个答案:

答案 0 :(得分:4)

在这种情况下,udf应该可以工作。

import math._
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.linalg.Vectors

//input two vectors of length n, but must be equal length
//output euclidean distance between the vectors
val euclideanDistance = udf { (v1: Vector, v2: Vector) =>
    sqrt(Vectors.sqdist(v1, v2))
}

像这样利用新的udf

joined_DF.withColumn("distance", euclideanDistance($"vector", $"vector2"))