我必须对具有400+百万行的表进行一些数据分析。我将其用于一个小样本,但我确定它将在生产中耗尽内存。
表结构如下(对于数百万个序列号):
+------------+---------------+------------+----------+
| date | serial_number | status_1 | status_2 |
+------------+---------------+------------+----------+
| 10/1/2018 | 123 | warehouse | v |
| 10/10/2018 | 123 | warehouse | w |
| 10/20/2018 | 123 | warehouse | x |
| 11/2/2018 | 123 | in transit | y |
+------------+---------------+------------+----------+
我需要获取日期,其中前一个日期为status_1 =“运输中”当前,而status_2 =“ x”。看起来应该像这样:
+-----------+---------------+------------+----------+------------+
| date_1 | serial_number | status_1 | status_2 | date_2 |
+-----------+---------------+------------+----------+------------+
| 11/2/2018 | 123 | in transit | x | 10/20/2018 |
+-----------+---------------+------------+----------+------------+
我使用两个等级函数得到了它,但这可能会在一个大桌子上使它窒息。
with transit as (
select
*
from (
select *,
rank() over(partition by serial_number order by date desc) rnk
from sample_t
order by serial_number, date asc
)
where rnk=1 and status_1 = 'in transit'
),
x_type as (
select
*
from (
select *,
rank() over(partition by serial_number order by date desc) rnk
from sample_t
order by serial_number, date asc
)
where rnk>1 and status_2 = 'x'
)
select tr.date date_1,
tr.serial_number,
tr.status_1,
x.status_2,
x.date date_2
from transit tr left join x_type x on tr.serial_number = x.serial_number
我看不到如何使用一个等级函数执行此操作。有没有更好,更有效的方法?
答案 0 :(得分:2)
您可以使用lag
来完成此操作。
select *
from (select t.*
,lag(status_2) over(partition by serial_no order by date) as prev_status_2
,lag(date) over(partition by serial_no order by date) as prev_date
from tbl t
) t
where status_1 = 'in_transit' and prev_status_2 = 'x'