有效地找到重叠的JodaTime Intervals

时间:2016-10-29 23:25:29

标签: scala apache-spark

我有一个Seq间隔,我想将重叠的间隔折叠起来。 我写过:

val today = DateTime.now()
val a: Seq[Interval] = Seq(
  new Interval(today.minusDays(11), today.minusDays(10)), //1 day interval ending 10 days ago
  new Interval(today.minusDays(10), today.minusDays(8)), //2 day interval ending 8 days ago, overlaps with above
  new Interval(today.minusDays(7), today.minusDays(5)), //2 day interval ending 5 days ago, DOES NOT OVERLAP
  new Interval(today.minusDays(4), today.minusDays(1)), //3 day interval ending 1 day ago, DOES NOT OVERLAP above
  new Interval(today.minusDays(4), today) //4 day interval ending today, overlaps with above
)
val actual = a.sortBy(_.getStartMillis).foldLeft(Seq[Interval]())((vals, e) => {
  if (vals.isEmpty) {
    vals ++ Seq(e)
  } else {
    val fst = vals.last
    val snd = e
    if (snd.getStart.getMillis <= fst.getEnd.getMillis) /*they overlap*/ {
      vals.dropRight(1) ++ Seq(new Interval(fst.getStart, snd.getEnd)) //combine both into one interval
    } else {
      vals ++ Seq(e)
    }
  }
})
val expected = Seq(
  new Interval(today.minusDays(11), today.minusDays(8)),
  new Interval(today.minusDays(7), today.minusDays(5)),
  new Interval(today.minusDays(4), today)
)
println(
  s"""
     |Expected: $expected
     |Actual  : $actual
   """.stripMargin)
assert(expected == actual)

哪个有效。我最初担心的是这条线 vals.dropRight(1) ++ Seq(new Interval(fst.getStart, snd.getEnd))

我怀疑dropRightO(n - m),在这种情况下n = |vals|m = 1

如果|a|大约数十万或更多,则此实施会变得昂贵。事实上vals ++ Seq(e)如果每个n + 1a[i],也会出现问题。

首先,我的评估是否正确?

第二,在没有可变数据结构的情况下,有没有更有效的方法来编写它?

我已经在其使用的上下文中写了这个,实际的应用程序是在Spark作业中(这样foldLeft将在RDD[MyType]上折叠)

编辑:完全忘记foldLeft上没有RDD(忽略Spark我不得不考虑另一种方式,但我和#39} ; m仍然对这个答案感兴趣,减去它在Spark中工作的事实

1 个答案:

答案 0 :(得分:4)

查看spire math library中的IntervalSeq [A]和IntervalTrie [A]数据结构。 IntervalTrie [A]允许执行布尔运算,例如联合和非重叠区间集的交集,具有极高的性能。它要求元素类型无损转换为long,这就是joda DateTime的情况。

以下是使用spire解决此问题的方法:

首先,确保您拥有正确的依赖项。添加依赖项以将其他内容添加到 build.sbt

{
    "name": "my-app",
    "profiles": ["native"],
    "label": null,
    "version": null,
    "state": null,
    "propertySources": [{
        "name": "classpath:/config/my-app.yml",
        "source": {
            "key.value": "${my.password}"
        }
    }]
}

接下来,您需要为 org.joda.time.DateTime 定义 IntervalTrie.Element 类型类实例:

libraryDependencies += "org.spire-math" %% "spire-extras" % "0.12.0"

现在,您可以使用IntervalTrie在DateTime间隔上执行布尔操作(请注意,Interval在这里指的是通用间隔类型spire.math.Interval,而不是joda Interval),

implicit val jodaDateTimeIsIntervalTrieElement: IntervalTrie.Element[DateTime] = new IntervalTrie.Element[DateTime] {
  override implicit val order: Order[DateTime] = Order.by(_.getMillis)
  override def toLong(value: DateTime): Long = value.getMillis
  override def fromLong(key: Long): DateTime = new DateTime(key)
}

这非常快。在我的机器上:

// import the Order[DateTime] instance (for spire.math.Interval creation)
import jodaDateTimeIsIntervalTrieElement.order

// create 100000 random Interval[DateTime]
val r = new scala.util.Random()
val n = 100000
val intervals = (0 until n).map { i =>
  val ticks = r.nextInt(1000000000) * 2000L
  val t0 = new DateTime(ticks)
  val t1 = new DateTime(ticks + r.nextInt(1000000000))
  val i = IntervalTrie(spire.math.Interval(t0, t1))
  i
}

//compute the union of all of them using IntervalTrie
val t0 = System.nanoTime()
val union = intervals.foldLeft(IntervalTrie.empty[DateTime])(_ | _)
val dt = (System.nanoTime() - t0) / 1.0e9
println(s"Union of $n random intervals took $dt seconds!")

使用预热做适当的基准将使这更快。