如何在Spark

时间:2017-06-27 14:14:58

标签: scala apache-spark dataframe

我想从第二个DataFrame中找到DataFrame的每个Element的最接近的值。

我有两个DF。第一个DataFrame(DF1)包含14.000.000个元素。 我拿了一个包含30.000个元素的Sample DataFrame(DF2)。

现在我想从DF2的所有元素中找到DF1中每个元素的最接近的值。

例如:

DF1:

Timestamp,                 Value
2014-01-01 00:00:01,       3.0
2014-01-01 00:00:05,       12.0
2014-01-01 00:00:09,       8.0
2014-01-01 00:00:10,       45.0
2014-01-01 00:00:15,       3.0
2014-01-01 00:00:21,       4.0
2014-01-01 00:00:32,       19.0

DF2:

Timestamp,                 Value
2014-01-01 00:00:01,       3.0
2014-01-01 00:00:10,       45.0
2014-01-01 00:00:09,       8.0

结果应如下所示:

resultDF

Timestamp,                 Value,     ClosestValue
2014-01-01 00:00:01,       3.0,       3.0
2014-01-01 00:00:05,       12.0,      8.0
2014-01-01 00:00:09,       8.0,       8.0
2014-01-01 00:00:10,       45.0,      45.0
2014-01-01 00:00:15,       3.0,       3.0
2014-01-01 00:00:21,       4.0,       3.0
2014-01-01 00:00:32,       19.0,      8.0
...

1 个答案:

答案 0 :(得分:1)

考虑到你的第二个DataFrame很小,我建议收集它的值并创建可用于搜索最近元素的广播变量。下一步是实现负责查找最接近元素的UDF。我认为您可以使用二进制搜索来实现此目的,因此总复杂度为O(N*logM),其中N - DF1的大小,M - {{1}的大小}。

第1步 - 创建广播变量

DF2

第2步 - 实施二进制搜索

// we need to sort values to enable fast searching using binary search
val values = df2.collect().map(r => r.getDouble(0)).sorted
val valuesBroadcast = session.sparkContext.broadcast(values)

第3步 - 创建UDF

def findClosest(element: Double, values: Array[Double]): Double = {
  var left = 0
  var right = values.length - 1
  var closest = Double.NaN
  var min = Double.MaxValue
  while(left <= right) {
    val mid = (left + right) / 2
    val current = values(mid)
    if(current == element) {
      closest = element
      left = right + 1
    }
    else {
      if(current < element) {
        left = mid + 1
      }
      else {
        right = mid - 1
      }
      val distance = (current - element).abs
      if(distance < min) {
        min = distance
        closest = current
      }
    }
  }
  closest
}

第4步 - 使用UDF

val findClosestUdf = udf((element: Double) => findClosest(element, valuesBroadcast.value))