如何选择连续的行直到满足其他列的条件?

时间:2019-06-08 14:21:35

标签: sql amazon-redshift

我有一个表(在Redshift中),它具有以下四列:

cust_id | timestamp | color | visted_pages_sequence 

我想为每个cust_id选择visted_pages_sequence LIKE '%first-page% and visted_pages_sequence LIKE '%end-page%之间的行。知道可能存在仅包含visited_pages_sequence的序列,而该序列仅具有像%first-page%这样的行,那么之后什么也没有。以及其他具有行序列的行,该行序列在visited_pa​​ges_sequence列中具有满足条件LIKE %first-page%的行,满足%mid-page-1%的连续行,满足条件LIKE %mid-page-2%的另一连续行,但是:没有行满足条件类似%end-page%

如何选择按customer_ids排序的数据?

这是我桌子的一个例子:

| cust_id | timestamp           | color   |   visited_page_sequence |
|---------|---------------------|---------|-------------------------|
| 54628   | 11/11/2015 11:46:00 |  black  |    this-first-page      |
|54628    | 11/11/2015 11:47:00 |  white  |    this-middle-page1    |
|94254    | 11/11/2015 11:48:00 |         |                         |
|45456    | 11/11/2015 11:49:00 |  braun  |    this-first-page      |
|45456    | 11/11/2015 11:50:00 |  beige  |    this-middle-page1    |
|45456    | 11/11/2015 11:52:00 |         |   this-end-page         |
|55411    | 11/11/2015 11:53:00 |  red    |                         |
|42462    | 11/11/2015 11:54:00 |  cyan   |     this-another-page   |
|24177    | 11/11/2015 11:55:00 |  orange |   this-first-page       |
|24177    | 11/11/2015 11:56:00 |  gray   |     this-next-page      |
|88888    | 11/11/2015 11:57:00 |  pink   |                         |
|94476    | 11/11/2015 11:58:00 |  black  |    this-first-page      |
|94476    | 11/11/2015 11:59:00 |  braun  |    this-middle-page1    |
|94476    | 11/11/2015 12:00:00 |         |    this-middle-page2    |
|94476    | 11/11/2015 12:01:00 |  white  |    this-end-page        |
|64579    | 11/11/2015 12:02:00 |  green  |    this-another-page    |

我想要这样的东西:

| cust_id | timestamp            | color     | visited_page_sequence |   
|---------|----------------------|-----------|-----------------------|
| 45456   | 11/11/2015 11:49:00  | braun     |this-first-page        |
| 45456   | 11/11/2015 11:50:00  | beige     |this-middle-page1      |
| 45456   | 11/11/2015 11:52:00  |           |this-end-page          |
| 94476   | 11/11/2015 11:58:00  | black     |this-first-page        |
| 94476   | 11/11/2015 11:59:00  | braun     |this-middle-page1      |
| 94476   | 11/11/2015 12:00:00  |           |this-middle-page2      |
| 94476   | 11/11/2015 12:01:00  | white     |this-end-page          |

PS: 1)每个cust_id可能有不止一行,且visited_pa​​ge_sequence如'%first-page%' 2)每个cust_id可能有不止一行,且带有visited_pa​​ge_sequence,例如'%middle-page-1%'或middle-page-2或此处未列出的任何其他中间页 3)每个cust_id中的不超过一行,其中visited_pa​​ge_sequence类似于“%end-page%” 4)(cust_id,timestamp)的组合没有重复项

在评论后编辑: 5)如果Visited_pa​​ge_sequence中的值连续两次出现,则应返回最后一次出现!

2 个答案:

答案 0 :(得分:0)

假设

  • 每个cust_idvisited_page_sequence like '%first-page%'的行数不超过
  • 每个cust_idvisited_page_sequence like '%end-page%'的行数不超过
  • (cust_id, timestamp)的组合没有重复项

您可以使用:

select t.*
from myTable f
join myTable l on l.cust_id = f.cust_id
join myTable t
  on  t.cust_id = f.cust_id
  and t.timestamp between f.timestamp and l.timestamp
where f.visited_page_sequence like '%first-page%'
  and l.visited_page_sequence like '%end-page%'
order by t.cust_id, t.timestamp

db-fiddle

答案 1 :(得分:0)

一种方法是首先找出每个客户的最小/最大时间戳,过滤出超出此范围的行。
像这样:

/*
  for each customer, find out the min/max timestamp we are interested in,
  ie when they first visited a 'first-page' and last visited a 'end-page'
*/
WITH
min_max_by_customer AS (
  SELECT
    cust_id,
    MIN(
      CASE
        WHEN visited_page_sequence LIKE '%first-page%' THEN timestamp
        ELSE null
      END
    ) AS min_first_page_timestamp,
    MAX(
      CASE
        WHEN visited_page_sequence LIKE '%end-page%' THEN timestamp
        ELSE null
      END
    ) AS max_end_page_timestamp
  FROM i
  GROUP BY cust_id
),
/*
  fetch the actual data we're interested in (ie timestamp between first-page/end-page).
  also flag a row to be removed if the next row contains the same 'visited_page_sequence'
*/
rows_per_customer AS(
  SELECT
    i.*,
    visited_page_sequence = LEAD(visited_page_sequence) OVER (PARTITION BY cust_id ORDER BY timestamp ASC) AS same_page_as_next_row
  FROM i
  JOIN min_max_by_customer
  USING (cust_id)
  WHERE i.timestamp BETWEEN min_first_page_timestamp AND max_end_page_timestamp
)
SELECT *
FROM rows_per_customer
WHERE same_page_as_next_row IS NOT TRUE /* XXX not the same as 'IS FALSE' due to SQL's three-value logic */
;

返回

┌─────────┬─────────────────────┬───────────────────────┐
│ cust_id │      timestamp      │ visited_page_sequence │
├─────────┼─────────────────────┼───────────────────────┤
│   45456 │ 2015-11-11 11:49:00 │  this-first-page      │
│   45456 │ 2015-11-11 11:50:00 │  this-middle-page1    │
│   45456 │ 2015-11-11 11:52:00 │ this-end-page         │
│   94476 │ 2015-11-11 11:58:00 │  this-first-page      │
│   94476 │ 2015-11-11 11:59:00 │  this-middle-page1    │
│   94476 │ 2015-11-11 12:00:00 │  this-middle-page2    │
│   94476 │ 2015-11-11 12:01:00 │  this-end-page        │
└─────────┴─────────────────────┴───────────────────────┘
(7 rows)