Question

我想编写一个代码来对行迭代器输入进行分组：Iterator[InputRow]通过时间戳记一个唯一的项目（unit和eventName），即eventTime应该是新Iterator[T]列表中的最新时间戳，InputRow定义为

case class InputRow(unit:Int, eventName: String, eventTime:java.sql.Timestamp, value: Int)

分组前的示例数据：

+-----------------------+----+---------+-----+
|eventTime              |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11    |2   |B        |1    |
|2018-06-02 16:05:12    |1   |A        |2    |
|2018-06-02 16:05:13    |2   |A        |2    |
|2018-06-02 16:05:14    |1   |A        |3    |
|2018-06-02 16:05:15    |2   |A        |3    |

后：

+-----------------------+----+---------+-----+
|eventTime              |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11    |2   |B        |1    |
|2018-06-02 16:05:14    |1   |A        |3    |
|2018-06-02 16:05:15    |2   |A        |3    |

在Scala中编写上述代码有什么好方法？

Answer 1

好消息：你的问题已经包含了与代码中使用的函数调用相对应的动词：group by，sort by（最新时间戳）。

要按最新时间戳排序InputRow，我们需要一个隐式排序：

implicit val rowSortByTimestamp: Ordering[InputRow] = 
    (r1: InputRow, r2: InputRow) => r1.eventTime.compareTo(r2.eventTime)
// or shorter:
// implicit val rowSortByTimestamp: Ordering[InputRow] = 
//   _.eventTime compareTo _.eventTime

现在，

val input: Iterator[InputRow] = // input data

让我们将它们分组（unit，eventName）

val result = input.toSeq.groupBy(row => (row.unit, row.eventName))

然后提取具有最新时间戳的那个

  .map { case (gr, rows) => rows.sorted.last }

从最早到最新排序

  .toSeq.sorted

结果是

InputRow(2,B,2018-06-02 16:05:11.0,1)
InputRow(1,A,2018-06-02 16:05:14.0,3)
InputRow(2,A,2018-06-02 16:05:15.0,3)

Answer 2

您可以使用struct 内置功能将eventTime和value列合并为struct，以便max groupBy eventTime和unit以及汇总时可以eventName（最新），这可以为您提供所需的输出

import org.apache.spark.sql.functions._
df.withColumn("struct", struct("eventTime", "value"))
    .groupBy("unit", "eventName")
    .agg(max("struct").as("struct"))
    .select(col("struct.eventTime"), col("unit"), col("eventName"), col("struct.value"))

as

+-------------------+----+---------+-----+
|eventTime          |unit|eventName|value|
+-------------------+----+---------+-----+
|2018-06-02 16:05:14|1   |A        |3    |
|2018-06-02 16:05:11|2   |B        |1    |
|2018-06-02 16:05:15|2   |A        |3    |
+-------------------+----+---------+-----+

Answer 3

您可以使用foldLeft和map：

来实现这一目标

val grouped: Map[(Int, String), InputRow] = 
  rows
    .foldLeft(Map.empty[(Int, String), Seq[InputRow]])({ case (acc, row) =>
     val key = (row.unit, row.eventName)
     // Get from the accumulator the Seq that already exists or Nil if
     // this key has never been seen before
     val value = acc.getOrElse(key, Nil)
     // Update the accumulator
     acc + (key -> (value :+ row))
  })
  // Get the last element from the list of rows when grouped by unit and event.
  .map({case (k, v) => k -> v.last})

这假定eventTime已经按排序顺序存储。如果这不是一个安全的假设，您可以为implicit Ordering定义java.sql.Timestamp并将v.last替换为v.maxBy(_.eventTime)。

请参阅here。

修改

或使用.groupBy(row => (row.unit, row.eventName))代替foldLeft：

implicit val ordering: Ordering[Timestamp] = _ compareTo _
val grouped = rows.groupBy(row => (row.unit, row.eventName))
                  .values
                  .map(_.maxBy(_.eventTime))

Scala：如何按时间戳将Iterable [T]分组为Iterable [T]

3 个答案:

修改