使用时间序列数据删除Spark数据框中的冗余行

时间:2015-07-19 20:34:54

标签: apache-spark dataframe apache-spark-sql

我有一个类似于此的Spark数据框(为了清晰起见,简化了时间戳和id列值):

| Timestamp | id |     status  |
--------------------------------
|         1 |  1 |     pending |
|         2 |  2 |     pending |
|         3 |  1 | in-progress |
|         4 |  1 | in-progress |
|         5 |  3 | in-progress |
|         6 |  1 |     pending |
|         7 |  4 |      closed |
|         8 |  1 |     pending |
|         9 |  1 | in-progress |

这是状态事件的时间序列。我最终想要的只是代表状态变化的行。从这个意义上讲,问题可以看作是删除冗余行的问题 - 例如时间4和8的条目 - 都是id = 1 - 应该被删除,因为它们不代表给定id的状态变化。

对于上面的行集,这将给出(顺序不重要):

| Timestamp | id |     status  |
--------------------------------
|         1 |  1 |     pending |
|         2 |  2 |     pending |
|         3 |  1 | in-progress |
|         5 |  3 | in-progress |
|         6 |  1 |     pending |
|         7 |  4 |      closed |
|         9 |  1 | in-progress |

原始计划是按ID和状态进行分区,按时间戳排序,然后选择每个分区的第一行 - 但是这会给出

| Timestamp | id |     status  |
--------------------------------
|         1 |  1 |     pending |
|         2 |  2 |     pending |
|         3 |  1 | in-progress |
|         5 |  3 | in-progress |
|         7 |  4 |      closed |

即。它失去了重复的状态变化。

任何指针都赞赏,我是数据框架的新手,可能会错过一两个技巧。

1 个答案:

答案 0 :(得分:1)

使用lag窗口函数应该可以解决问题

case class Event(timestamp: Int, id: Int, status: String)

val events = sqlContext.createDataFrame(sc.parallelize(
    Event(1, 1, "pending") :: Event(2, 2, "pending") ::
    Event(3, 1, "in-progress") :: Event(4, 1, "in-progress") ::
    Event(5, 3, "in-progress") :: Event(6, 1, "pending") ::
    Event(7, 4, "closed") :: Event(8, 1, "pending") ::
    Event(9, 1, "in-progress") :: Nil
))

events.registerTempTable("events")

val query = """SELECT timestamp, id, status FROM (
    SELECT timestamp, id, status, lag(status) OVER (
        PARTITION BY id ORDER BY timestamp
    ) AS prev_status  FROM events) tmp
    WHERE prev_status IS NULL OR prev_status != status
    ORDER BY timestamp, id"""

sqlContext.sql(query).show

内部查询

SELECT timestamp, id, status, lag(status) OVER (
    PARTITION BY id ORDER BY timestamp
) AS prev_status  FROM events

按以下方式创建表格,其中prev_status是给定status的先前值id并按timestamp排序。

+---------+--+-----------+-----------+
|timestamp|id|     status|prev_status|
+---------+--+-----------+-----------+
|        1| 1|    pending|       null|
|        3| 1|in-progress|    pending|
|        4| 1|in-progress|in-progress|
|        6| 1|    pending|in-progress|
|        8| 1|    pending|    pending|
|        9| 1|in-progress|    pending|
|        2| 2|    pending|       null|
|        5| 3|in-progress|       null|
|        7| 4|     closed|       null|
+---------+--+-----------+-----------+

外部查询

SELECT timestamp, id, status FROM (...)
WHERE prev_status IS NULL OR prev_status != status
ORDER BY timestamp, id

只需过滤prev_statusNULL的行(给定id的第一行)或prev_statusstatus不同(状态发生变化)在连续的时间戳之间)。添加订单只是为了使视觉检查更容易。