我想从第二个DataFrame中找到DataFrame的每个Element的最接近的值。
我有两个DF。第一个DataFrame(DF1)包含14.000.000个元素。 我拿了一个包含30.000个元素的Sample DataFrame(DF2)。
现在我想从DF2的所有元素中找到DF1中每个元素的最接近的值。
例如:
DF1:
Timestamp, Value
2014-01-01 00:00:01, 3.0
2014-01-01 00:00:05, 12.0
2014-01-01 00:00:09, 8.0
2014-01-01 00:00:10, 45.0
2014-01-01 00:00:15, 3.0
2014-01-01 00:00:21, 4.0
2014-01-01 00:00:32, 19.0
DF2:
Timestamp, Value
2014-01-01 00:00:01, 3.0
2014-01-01 00:00:10, 45.0
2014-01-01 00:00:09, 8.0
结果应如下所示:
resultDF
Timestamp, Value, ClosestValue
2014-01-01 00:00:01, 3.0, 3.0
2014-01-01 00:00:05, 12.0, 8.0
2014-01-01 00:00:09, 8.0, 8.0
2014-01-01 00:00:10, 45.0, 45.0
2014-01-01 00:00:15, 3.0, 3.0
2014-01-01 00:00:21, 4.0, 3.0
2014-01-01 00:00:32, 19.0, 8.0
...
答案 0 :(得分:1)
考虑到你的第二个DataFrame
很小,我建议收集它的值并创建可用于搜索最近元素的广播变量。下一步是实现负责查找最接近元素的UDF。我认为您可以使用二进制搜索来实现此目的,因此总复杂度为O(N*logM)
,其中N
- DF1
的大小,M
- {{1}的大小}。
DF2
// we need to sort values to enable fast searching using binary search
val values = df2.collect().map(r => r.getDouble(0)).sorted
val valuesBroadcast = session.sparkContext.broadcast(values)
def findClosest(element: Double, values: Array[Double]): Double = {
var left = 0
var right = values.length - 1
var closest = Double.NaN
var min = Double.MaxValue
while(left <= right) {
val mid = (left + right) / 2
val current = values(mid)
if(current == element) {
closest = element
left = right + 1
}
else {
if(current < element) {
left = mid + 1
}
else {
right = mid - 1
}
val distance = (current - element).abs
if(distance < min) {
min = distance
closest = current
}
}
}
closest
}
val findClosestUdf = udf((element: Double) => findClosest(element, valuesBroadcast.value))