我有两个数据帧df1
和df2
,它们具有以下结构:
print(df1)
+-------+------------+-------------+---------+
| id| vector| start_time | end_time|
+-------+------------+-------------+---------+
| 1| [0,0,0,0,0]| 000| 200|
| 2| [1,1,1,1,1]| 200| 500|
| 3| [0,1,0,1,0]| 100| 500|
+-------+------------+-------------+---------+
print(df2)
+-------+------------+-------+
| id| vector| time|
+-------+------------+-------+
| A| [0,1,1,1,0]| 050|
| B| [1,0,0,1,1]| 150|
| C| [1,1,1,1,1]| 250|
| D| [1,0,1,0,1]| 350|
| E| [1,1,1,1,1]| 450|
| F| [1,0,5,0,0]| 550|
+-------+------------+-------+
我想要的是:对于df1
的每个数据,从df2
在time
和start_time
之间的end_time
获取所有数据,并为所有这些数据都计算出两个向量之间的欧式距离。
我从下面的代码开始,但是我在计算距离的过程中陷入了困境:
val joined_DF = kafka_DF.crossJoin(
hdfs_DF.withColumnRenamed("id","id2").withColumnRenamed("vector","vector2")
)
.filter(col("time")>= col("start_time") &&
col("time")<= col("end_time"))
.withColumn("distance", ???) // Euclidean distance element-wise between columns vector and column vector2
以下是示例数据上的预期输出:
+-------+------------+-------------+---------+-------+------------+------+----------+
| id| vector| start_time | end_time| id2| vector2| time| distance |
+-------+------------+-------------+---------+-------+------------+------+----------+
| 1| [0,0,0,0,0]| 000| 200| A| [0,1,1,1,0]| 050| 1.73205|
| 1| [0,0,0,0,0]| 000| 200| B| [1,0,0,1,1]| 150| 1.73205|
| 2| [1,1,1,1,1]| 200| 500| C| [1,1,1,1,1]| 250| 0|
| 2| [1,1,1,1,1]| 200| 500| D| [1,0,1,0,1]| 350| 1.41421|
| 2| [1,1,1,1,1]| 200| 500| E| [1,1,1,1,1]| 450| 0|
| 3| [0,1,0,1,0]| 100| 500| B| [1,0,0,1,1]| 150| 1.73205|
| 3| [0,1,0,1,0]| 100| 500| C| [1,1,1,1,1]| 250| 1.73205|
| 3| [0,1,0,1,0]| 100| 500| D| [1,0,1,0,1]| 350| 2.23606|
| 3| [0,1,0,1,0]| 100| 500| E| [1,1,1,1,1]| 450| 1.73205|
+-------+------------+-------------+---------+-------+------------+------+----------+
注意:
df1
总是会有少量数据,因此crossJoin不会冒充我的内存的风险。答案 0 :(得分:4)
在这种情况下,udf
应该可以工作。
import math._
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.linalg.Vectors
//input two vectors of length n, but must be equal length
//output euclidean distance between the vectors
val euclideanDistance = udf { (v1: Vector, v2: Vector) =>
sqrt(Vectors.sqdist(v1, v2))
}
像这样利用新的udf
:
joined_DF.withColumn("distance", euclideanDistance($"vector", $"vector2"))