我有一个类似于此的Spark数据框(为了清晰起见,简化了时间戳和id列值):
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 4 | 1 | in-progress |
| 5 | 3 | in-progress |
| 6 | 1 | pending |
| 7 | 4 | closed |
| 8 | 1 | pending |
| 9 | 1 | in-progress |
这是状态事件的时间序列。我最终想要的只是代表状态变化的行。从这个意义上讲,问题可以看作是删除冗余行的问题 - 例如时间4和8的条目 - 都是id = 1 - 应该被删除,因为它们不代表给定id的状态变化。
对于上面的行集,这将给出(顺序不重要):
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 5 | 3 | in-progress |
| 6 | 1 | pending |
| 7 | 4 | closed |
| 9 | 1 | in-progress |
原始计划是按ID和状态进行分区,按时间戳排序,然后选择每个分区的第一行 - 但是这会给出
| Timestamp | id | status |
--------------------------------
| 1 | 1 | pending |
| 2 | 2 | pending |
| 3 | 1 | in-progress |
| 5 | 3 | in-progress |
| 7 | 4 | closed |
即。它失去了重复的状态变化。
任何指针都赞赏,我是数据框架的新手,可能会错过一两个技巧。
答案 0 :(得分:1)
使用lag
窗口函数应该可以解决问题
case class Event(timestamp: Int, id: Int, status: String)
val events = sqlContext.createDataFrame(sc.parallelize(
Event(1, 1, "pending") :: Event(2, 2, "pending") ::
Event(3, 1, "in-progress") :: Event(4, 1, "in-progress") ::
Event(5, 3, "in-progress") :: Event(6, 1, "pending") ::
Event(7, 4, "closed") :: Event(8, 1, "pending") ::
Event(9, 1, "in-progress") :: Nil
))
events.registerTempTable("events")
val query = """SELECT timestamp, id, status FROM (
SELECT timestamp, id, status, lag(status) OVER (
PARTITION BY id ORDER BY timestamp
) AS prev_status FROM events) tmp
WHERE prev_status IS NULL OR prev_status != status
ORDER BY timestamp, id"""
sqlContext.sql(query).show
内部查询
SELECT timestamp, id, status, lag(status) OVER (
PARTITION BY id ORDER BY timestamp
) AS prev_status FROM events
按以下方式创建表格,其中prev_status
是给定status
的先前值id
并按timestamp
排序。
+---------+--+-----------+-----------+
|timestamp|id| status|prev_status|
+---------+--+-----------+-----------+
| 1| 1| pending| null|
| 3| 1|in-progress| pending|
| 4| 1|in-progress|in-progress|
| 6| 1| pending|in-progress|
| 8| 1| pending| pending|
| 9| 1|in-progress| pending|
| 2| 2| pending| null|
| 5| 3|in-progress| null|
| 7| 4| closed| null|
+---------+--+-----------+-----------+
外部查询
SELECT timestamp, id, status FROM (...)
WHERE prev_status IS NULL OR prev_status != status
ORDER BY timestamp, id
只需过滤prev_status
为NULL
的行(给定id
的第一行)或prev_status
与status
不同(状态发生变化)在连续的时间戳之间)。添加订单只是为了使视觉检查更容易。