Question

通过一个例子可以更容易地解释我的场景。说我有以下数据：

Type Time A 1 B 3 A 5 B 9

我想为每一行添加一个额外的列，表示同一类型的所有列之间的最小绝对值差异。因此，对于第一行，类型A的所有时间之间的最小差异为4，因此第1列和第3列的值为4，而第2列和第4列的值为6。

我在Spark和Spark SQL中这样做，所以那里的指导会更有用，但如果需要通过普通的SQL来解释，那将是一个很好的帮助。

Answer 1

在sql server 2008中测试

创建表d（ type varchar（25），时间int ）

insert into d
values ('A',1),
('B',3),
('A',5),
('B',9)

--solution one, calculation in query, might not be smart if dataset is large.
select *
, (select max(time) m from d as i where i.type = o.type) - (select MIN(time) m from d as i where i.type = o.type) dif 
 from d as o

--or this
 select d.*, diftable.dif from d inner join 
 (select type, MAX(time) - MIN(time) dif
from d group by type ) as diftable on d.type = diftable.type

Answer 2

一种可能的方法是使用窗口函数。

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{lag, min, abs}

val df = Seq(
  ("A", -10), ("A", 1), ("A", 5), ("B", 3), ("B", 9)
).toDF("type", "time")

首先让我们确定按时间排序的连续行之间的差异：

// Partition by type and sort by time
val w1 = Window.partitionBy($"Type").orderBy($"Time")

// Difference between this and previous
val diff = $"time" - lag($"time", 1).over(w1)

然后找到给定类型的所有差异的最小值：

// Partition by time unordered and take unbounded window
val w2 = Window.partitionBy($"Type").rowsBetween(Long.MinValue, Long.MaxValue)

// Minimum difference over type
val minDiff = min(diff).over(w2)

df.withColumn("min_diff",  minDiff).show


// +----+----+--------+
// |type|time|min_diff|
// +----+----+--------+
// |   A| -10|       4|
// |   A|   1|       4|
// |   A|   5|       4|
// |   B|   3|       6|
// |   B|   9|       6|
// +----+----+--------+

如果您的目标是找到当前行与组中任何其他行之间的最小距离，您可以使用类似的方法

import org.apache.spark.sql.functions.{lead, when}

// Diff to previous
val diff_lag = $"time" - lag($"time", 1).over(w1)

// Diff to next
val diff_lead = lead($"time", 1).over(w1) - $"time"

val diffToClosest = when(
  diff_lag < diff_lead || diff_lead.isNull, 
  diff_lag
).otherwise(diff_lead)

df.withColumn("diff_to_closest", diffToClosest)

// +----+----+---------------+
// |type|time|diff_to_closest|
// +----+----+---------------+
// |   A| -10|             11|
// |   A|   1|              4|
// |   A|   5|              4|
// |   B|   3|              6|
// |   B|   9|              6|
// +----+----+---------------+

Answer 3

你应该尝试这样的事情：

val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val input = sc.parallelize(Seq(
  ("A", 1),
  ("B", 3),
  ("A", 5),
  ("B", 9)
))

val df = input.groupByKey().flatMap { case (key, values) =>
  val smallestDiff = values.toList.sorted match {
    case firstMin :: secondMin :: _ => secondMin - firstMin
    case singleVal :: Nil => singleVal // Only one record for some `Type`
  }

  values.map { value =>
    (key, value, smallestDiff)
  }
}.toDF("Type", "Time", "SmallestDiff")

df.show()

输出：

+----+----+------------+
|Type|Time|SmallestDiff|
+----+----+------------+
|   A|   1|           4|
|   A|   5|           4|
|   B|   3|           6|
|   B|   9|           6|
+----+----+------------+

添加一个额外的列，表示前一列的最近差异之间的差异

3 个答案: