我有一个Seq
间隔,我想将重叠的间隔折叠起来。
我写过:
val today = DateTime.now()
val a: Seq[Interval] = Seq(
new Interval(today.minusDays(11), today.minusDays(10)), //1 day interval ending 10 days ago
new Interval(today.minusDays(10), today.minusDays(8)), //2 day interval ending 8 days ago, overlaps with above
new Interval(today.minusDays(7), today.minusDays(5)), //2 day interval ending 5 days ago, DOES NOT OVERLAP
new Interval(today.minusDays(4), today.minusDays(1)), //3 day interval ending 1 day ago, DOES NOT OVERLAP above
new Interval(today.minusDays(4), today) //4 day interval ending today, overlaps with above
)
val actual = a.sortBy(_.getStartMillis).foldLeft(Seq[Interval]())((vals, e) => {
if (vals.isEmpty) {
vals ++ Seq(e)
} else {
val fst = vals.last
val snd = e
if (snd.getStart.getMillis <= fst.getEnd.getMillis) /*they overlap*/ {
vals.dropRight(1) ++ Seq(new Interval(fst.getStart, snd.getEnd)) //combine both into one interval
} else {
vals ++ Seq(e)
}
}
})
val expected = Seq(
new Interval(today.minusDays(11), today.minusDays(8)),
new Interval(today.minusDays(7), today.minusDays(5)),
new Interval(today.minusDays(4), today)
)
println(
s"""
|Expected: $expected
|Actual : $actual
""".stripMargin)
assert(expected == actual)
哪个有效。我最初担心的是这条线
vals.dropRight(1) ++ Seq(new Interval(fst.getStart, snd.getEnd))
我怀疑dropRight
是O(n - m)
,在这种情况下n = |vals|
和m = 1
。
如果|a|
大约数十万或更多,则此实施会变得昂贵。事实上vals ++ Seq(e)
如果每个n + 1
都a[i]
,也会出现问题。
首先,我的评估是否正确?
第二,在没有可变数据结构的情况下,有没有更有效的方法来编写它?
我已经在其使用的上下文中写了这个,实际的应用程序是在Spark作业中(这样foldLeft
将在RDD[MyType]
上折叠)
编辑:完全忘记foldLeft
上没有RDD
(忽略Spark我不得不考虑另一种方式,但我和#39} ; m仍然对这个答案感兴趣,减去它在Spark中工作的事实
答案 0 :(得分:4)
查看spire math library中的IntervalSeq [A]和IntervalTrie [A]数据结构。 IntervalTrie [A]允许执行布尔运算,例如联合和非重叠区间集的交集,具有极高的性能。它要求元素类型无损转换为long,这就是joda DateTime的情况。
以下是使用spire解决此问题的方法:
首先,确保您拥有正确的依赖项。添加依赖项以将其他内容添加到 build.sbt :
{
"name": "my-app",
"profiles": ["native"],
"label": null,
"version": null,
"state": null,
"propertySources": [{
"name": "classpath:/config/my-app.yml",
"source": {
"key.value": "${my.password}"
}
}]
}
接下来,您需要为 org.joda.time.DateTime 定义 IntervalTrie.Element 类型类实例:
libraryDependencies += "org.spire-math" %% "spire-extras" % "0.12.0"
现在,您可以使用IntervalTrie在DateTime间隔上执行布尔操作(请注意,Interval在这里指的是通用间隔类型spire.math.Interval,而不是joda Interval),
implicit val jodaDateTimeIsIntervalTrieElement: IntervalTrie.Element[DateTime] = new IntervalTrie.Element[DateTime] {
override implicit val order: Order[DateTime] = Order.by(_.getMillis)
override def toLong(value: DateTime): Long = value.getMillis
override def fromLong(key: Long): DateTime = new DateTime(key)
}
这非常快。在我的机器上:
// import the Order[DateTime] instance (for spire.math.Interval creation)
import jodaDateTimeIsIntervalTrieElement.order
// create 100000 random Interval[DateTime]
val r = new scala.util.Random()
val n = 100000
val intervals = (0 until n).map { i =>
val ticks = r.nextInt(1000000000) * 2000L
val t0 = new DateTime(ticks)
val t1 = new DateTime(ticks + r.nextInt(1000000000))
val i = IntervalTrie(spire.math.Interval(t0, t1))
i
}
//compute the union of all of them using IntervalTrie
val t0 = System.nanoTime()
val union = intervals.foldLeft(IntervalTrie.empty[DateTime])(_ | _)
val dt = (System.nanoTime() - t0) / 1.0e9
println(s"Union of $n random intervals took $dt seconds!")
使用预热做适当的基准将使这更快。