Spark SQL窗口函数/ lag()给出了意想不到的结果

时间:2015-11-02 16:19:30

标签: apache-spark apache-spark-sql pyspark pyspark-sql

编辑: 经过一些更多的故障排除后,我发现以下内容 - 我在查询中添加了滞后(event_time),以查看查询收集的日期,并得到同样奇怪的结果:

              device_id               |     event_time      |    unix_time    |               event_id               |          lag_time
--------------------------------------+---------------------+-----------------+--------------------------------------+----------------------------
 111111111111111111111111111111111111 | 2015-03-01 10:41:47 | 1425206507.0000 | 4C48BE67-31EB-4432-96EF-0F30B4191340 |
 111111111111111111111111111111111111 | 2015-03-01 10:41:48 | 1425206508.0000 | A4AE33A2-6CDC-480C-A8D7-C4024810D236 | 2015-03-01 10:41:47
 111111111111111111111111111111111111 | 2015-03-01 10:41:51 | 1425206511.0000 | 997614AE-CE7E-45F6-BCE3-93E70CD46609 | 2015-03-01 10:41:48
 111111111111111111111111111111111111 | 2015-03-01 10:41:53 | 1425206513.0000 | 202DA38C-1823-4100-85AB-3DE139FB3CE3 | 2015-03-01 10:41:51
 111111111111111111111111111111111111 | 2015-03-01 10:42:11 | 1425206531.0000 | 8DFA7938-2123-4978-A89D-6C92404B504D | 2015-03-01 10:41:53
 111111111111111111111111111111111111 | 2015-03-01 10:42:12 | 1425206532.0000 | 463E9833-9526-4E76-A907-4651C13991A0 | 2015-03-01 10:42:11
 111111111111111111111111111111111111 | 2015-03-01 10:42:14 | 1425206534.0000 | 8204696E-3DAA-423E-9031-EC80BFA1157E | 2015-03-01 10:42:12
 111111111111111111111111111111111111 | 2015-03-01 10:42:20 | 1425206540.0000 | 10DB02E2-2D4F-4611-98D7-966074FBD398 | 2015-03-01 10:42:14
 111111111111111111111111111111111111 | 2015-03-01 10:42:20 | 1425206540.0000 | 50535667-A02D-47F7-8320-86EAC4638964 | 2015-03-01 10:42:20
 111111111111111111111111111111111111 | 2015-03-01 10:42:27 | 1425206547.0000 | 6C8BC79D-CB3E-4FE2-8EDD-B8421EA237FF | 2015-03-01 10:42:20
 111111111111111111111111111111111111 | 2015-03-01 10:42:27 | 1425206547.0000 | A732E59D-2EEE-44AE-BB5B-D016E6E6C597 | 2015-03-01 10:42:27
 111111111111111111111111111111111111 | 2015-03-01 10:42:27 | 1425206547.0000 | ABBE4184-65C6-41C1-AC0B-74828DDD8DE8 | 2015-03-01 10:42:27
 111111111111111111111111111111111111 | 2015-03-01 10:42:40 | 1425206560.0000 | 03D0B5FF-9E4D-4F14-8169-7D1C93C617B3 | 2015-03-01 10:42:27
 111111111111111111111111111111111111 | 2015-03-01 10:42:40 | 1425206560.0000 | 5C1DFE08-8081-4C84-9D8F-E29EBEB1C9D5 | 2015-03-01 10:42:40
 111111111111111111111111111111111111 | 2015-03-01 10:42:40 | 1425206560.0000 | 82C50F8E-9790-4C27-8979-9484954934B4 | 2015-03-01 10:42:40
 111111111111111111111111111111111111 | 2015-03-01 10:42:42 | 1425206562.0000 | 2D30722E-FB37-4563-B2CD-FA545D95AAB4 | 2015-03-01 10:42:40
 111111111111111111111111111111111111 | 2015-03-01 10:42:49 | 1425206569.0000 | 7613F856-763E-4792-904E-8D89E2502710 | 2015-03-01 10:42:42
 111111111111111111111111111111111111 | 2015-03-01 10:43:01 | 1425206581.0000 | DB39294A-E133-4A05-B367-210944965FA3 | 2015-03-01 10:42:49
 111111111111111111111111111111111111 | 2015-03-01 10:43:02 | 1425206582.0000 | 61B8AE48-D5C8-4809-9C4F-56EEA45E3626 | 2015-03-01 10:43:01
 111111111111111111111111111111111111 | 2015-03-01 10:43:02 | 1425206582.0000 | 82870AB0-08F1-403F-B805-836CC1454D1A | 2015-03-01 10:43:02
 111111111111111111111111111111111111 | 2015-03-01 10:43:04 | 1425206584.0000 | BA24E540-0F5B-4BC4-B59B-29D729FA5C35 | 2015-03-01 10:43:02
 999999999999999999999999999999999999 | 2015-05-13 16:40:19 | 1431535219.0000 | 25E20777-508D-4194-9324-BE8A44CE7B59 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:40:19 | 1431535219.0000 | 72DCEE64-CB3A-43CD-A949-FBE35C0A8873 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:40:21 | 1431535221.0000 | A0062926-C8A0-47ED-A7F0-65716893DDC0 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:40:25 | 1431535225.0000 | 0BCEABD0-DCEB-431C-A6FE-6E1243FC9890 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:40:34 | 1431535234.0000 | 8F72E8D1-A167-460B-9034-C4D1423CAF15 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:19 | 1431535279.0000 | 44C0214C-E13B-4CBB-92BB-4809EAD5DF15 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:20 | 1431535280.0000 | C8CBB801-E8F3-47FD-B9C9-50BD0DDFA356 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:22 | 1431535282.0000 | C3173090-8BB8-407F-BFF5-07B58BCFCA7B | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:48 | 1431535308.0000 | 91290A91-2394-4C5F-8687-6A74593F9FFB | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:48 | 1431535308.0000 | EF138C8C-93BC-4C29-8438-EB0214B3CFC4 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:52 | 1431535312.0000 | 76AA328F-D46D-4235-B780-2B05FCE02855 | 1970-01-03 09:15:58.430244
 999999999999999999999999999999999999 | 2015-05-13 16:41:53 | 1431535313.0000 | 563E04CE-7DDB-4543-8F1D-C81625E839F8 | 1970-01-03 09:15:58.430244

对于某些device_ids,它会按预期返回,但是某些device_ids会返回以下内容:

SELECT device_id,
       unix_time,
       event_id,
       unix_time - lag(unix_time)
          OVER
        (PARTITION BY device_id ORDER BY unix_time,event_id)
         AS seconds_since_last_event
FROM ios_d_events;

当device_id的前两个事件具有相同的时间戳时,看起来会发生这种情况。希望这给某人一个线索?

OP

我正在尝试使用Spark SQL中的lag()函数来确定表中两个后续事件之间的时间长度。重要的列是device_id,文本列,unix_time,数字时间戳和event_id,它们对于每一行都是唯一的。

我正在运行的查询:

if(unix_time = lag(unix_time)
     OVER (PARTITION BY device_id ORDER BY unix_time,event_id),
   0,
   unix_time - lag(unix_time)
     OVER (PARTITION BY device_id ORDER BY unix_time,event_id))
 AS seconds_since_last_event

在Postgres中,这给出了预期的结果 - 但是当在Pyspark中运行时,任何时候有两个具有相同时间戳的事件,seconds_since_last_event被计算为一个大数字,即-1435151676846888或-1431583545415023,或25534 - 我可以&#39 ;弄清楚这些数字的来源。

我尝试在查询中添加if()语句,如

_BSMachError: (os/kern) invalid capability (20)
_BSMachError: (os/kern) invalid name (15)

但我得到的结果相同。可能导致这种情况的任何想法?

1 个答案:

答案 0 :(得分:1)

事实证明这是Spark 1.5.0中的一个错误 - 该问题在1.5.1中没有出现。