通过一个例子可以更容易地解释我的场景。说我有以下数据:
Type Time
A 1
B 3
A 5
B 9
我想为每一行添加一个额外的列,表示同一类型的所有列之间的最小绝对值差异。因此,对于第一行,类型A的所有时间之间的最小差异为4,因此第1列和第3列的值为4,而第2列和第4列的值为6。
我在Spark和Spark SQL中这样做,所以那里的指导会更有用,但如果需要通过普通的SQL来解释,那将是一个很好的帮助。
答案 0 :(得分:1)
在sql server 2008中测试
创建表d( type varchar(25), 时间int )
insert into d
values ('A',1),
('B',3),
('A',5),
('B',9)
--solution one, calculation in query, might not be smart if dataset is large.
select *
, (select max(time) m from d as i where i.type = o.type) - (select MIN(time) m from d as i where i.type = o.type) dif
from d as o
--or this
select d.*, diftable.dif from d inner join
(select type, MAX(time) - MIN(time) dif
from d group by type ) as diftable on d.type = diftable.type
答案 1 :(得分:1)
一种可能的方法是使用窗口函数。
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{lag, min, abs}
val df = Seq(
("A", -10), ("A", 1), ("A", 5), ("B", 3), ("B", 9)
).toDF("type", "time")
首先让我们确定按时间排序的连续行之间的差异:
// Partition by type and sort by time
val w1 = Window.partitionBy($"Type").orderBy($"Time")
// Difference between this and previous
val diff = $"time" - lag($"time", 1).over(w1)
然后找到给定类型的所有差异的最小值:
// Partition by time unordered and take unbounded window
val w2 = Window.partitionBy($"Type").rowsBetween(Long.MinValue, Long.MaxValue)
// Minimum difference over type
val minDiff = min(diff).over(w2)
df.withColumn("min_diff", minDiff).show
// +----+----+--------+
// |type|time|min_diff|
// +----+----+--------+
// | A| -10| 4|
// | A| 1| 4|
// | A| 5| 4|
// | B| 3| 6|
// | B| 9| 6|
// +----+----+--------+
如果您的目标是找到当前行与组中任何其他行之间的最小距离,您可以使用类似的方法
import org.apache.spark.sql.functions.{lead, when}
// Diff to previous
val diff_lag = $"time" - lag($"time", 1).over(w1)
// Diff to next
val diff_lead = lead($"time", 1).over(w1) - $"time"
val diffToClosest = when(
diff_lag < diff_lead || diff_lead.isNull,
diff_lag
).otherwise(diff_lead)
df.withColumn("diff_to_closest", diffToClosest)
// +----+----+---------------+
// |type|time|diff_to_closest|
// +----+----+---------------+
// | A| -10| 11|
// | A| 1| 4|
// | A| 5| 4|
// | B| 3| 6|
// | B| 9| 6|
// +----+----+---------------+
答案 2 :(得分:0)
你应该尝试这样的事情:
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("A", 1),
("B", 3),
("A", 5),
("B", 9)
))
val df = input.groupByKey().flatMap { case (key, values) =>
val smallestDiff = values.toList.sorted match {
case firstMin :: secondMin :: _ => secondMin - firstMin
case singleVal :: Nil => singleVal // Only one record for some `Type`
}
values.map { value =>
(key, value, smallestDiff)
}
}.toDF("Type", "Time", "SmallestDiff")
df.show()
输出:
+----+----+------------+
|Type|Time|SmallestDiff|
+----+----+------------+
| A| 1| 4|
| A| 5| 4|
| B| 3| 6|
| B| 9| 6|
+----+----+------------+