Question

我有一个Pair RDD，它包含（Key，（Timestamp，Value））条目。

读取数据时，条目按时间戳排序，因此RDD的每个分区都应按时间戳排序。我想要做的是，找到每个键，两个连续时间戳之间的最大差距。

我现在很长时间都在考虑这个问题，而且我不知道如果火花提供的功能可以实现这个问题。我看到的问题是：当我做一个简单的地图时，我会丢失订单信息，所以这是不可能的。在我看来，groupByKey失败了，因为特定键的条目太多，尝试这样做会给我一个java.io.IOException: No space left on device

任何有关如何处理此问题的帮助都会非常有用。

Answer 1

根据The Archetypal Paul的建议，您可以使用DataFrame和窗口函数。首先需要进口：

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag

下一个数据必须转换为DataFrame：

val df = rdd.mapValues(_._1).toDF("key", "timestamp")

为了能够使用lag功能，我们需要一个窗口定义：

val keyTimestampWindow = Window.partitionBy("key").orderBy("timestamp")

可用于选择：

val withGap = df.withColumn(
  "gap", $"timestamp" - lag("timestamp", 1).over(keyTimestampWindow)
)

最后groupBy与max：

withGap.groupBy("key").max("gap")

按照The Archetypal Paul的第二条建议，你可以按键和时间戳排序。

val sorted = rdd.mapValues(_._1).sortBy(identity)

对于这样排列的数据，您可以通过滑动和按键缩小来找到每个键的最大间隙：

import org.apache.spark.mllib.rdd.RDDFunctions._

sorted.sliding(2).collect {
  case Array((key1, val1), (key2, val2)) if key1 == key2 => (key1, val2 - val1)
}.reduceByKey(Math.max(_, _))

同一想法的另一种变体，首先进行重新分区和排序：

val partitionedAndSorted = rdd
  .mapValues(_._1)
  .repartitionAndSortWithinPartitions(
    new org.apache.spark.HashPartitioner(rdd.partitions.size)
  )

这样的数据可以转换

val lagged = partitionedAndSorted.mapPartitions(_.sliding(2).collect {
  case Seq((key1, val1), (key2, val2)) if key1 == key2 => (key1, val2 - val1)
}, preservesPartitioning=true)

和reduceByKey：

lagged.reduceByKey(Math.max(_, _))

Spark在时间戳中找到差距

1 个答案: