我有一个表(在Redshift中),它具有以下四列:
cust_id | timestamp | color | visted_pages_sequence
我想为每个cust_id选择visted_pages_sequence LIKE '%first-page% and visted_pages_sequence LIKE '%end-page%
之间的行。知道可能存在仅包含visited_pages_sequence
的序列,而该序列仅具有像%first-page%
这样的行,那么之后什么也没有。以及其他具有行序列的行,该行序列在visited_pages_sequence列中具有满足条件LIKE %first-page%
的行,满足%mid-page-1%
的连续行,满足条件LIKE %mid-page-2%
的另一连续行,但是:没有行满足条件类似%end-page%
。
如何选择按customer_ids排序的数据?
这是我桌子的一个例子:
| cust_id | timestamp | color | visited_page_sequence |
|---------|---------------------|---------|-------------------------|
| 54628 | 11/11/2015 11:46:00 | black | this-first-page |
|54628 | 11/11/2015 11:47:00 | white | this-middle-page1 |
|94254 | 11/11/2015 11:48:00 | | |
|45456 | 11/11/2015 11:49:00 | braun | this-first-page |
|45456 | 11/11/2015 11:50:00 | beige | this-middle-page1 |
|45456 | 11/11/2015 11:52:00 | | this-end-page |
|55411 | 11/11/2015 11:53:00 | red | |
|42462 | 11/11/2015 11:54:00 | cyan | this-another-page |
|24177 | 11/11/2015 11:55:00 | orange | this-first-page |
|24177 | 11/11/2015 11:56:00 | gray | this-next-page |
|88888 | 11/11/2015 11:57:00 | pink | |
|94476 | 11/11/2015 11:58:00 | black | this-first-page |
|94476 | 11/11/2015 11:59:00 | braun | this-middle-page1 |
|94476 | 11/11/2015 12:00:00 | | this-middle-page2 |
|94476 | 11/11/2015 12:01:00 | white | this-end-page |
|64579 | 11/11/2015 12:02:00 | green | this-another-page |
我想要这样的东西:
| cust_id | timestamp | color | visited_page_sequence |
|---------|----------------------|-----------|-----------------------|
| 45456 | 11/11/2015 11:49:00 | braun |this-first-page |
| 45456 | 11/11/2015 11:50:00 | beige |this-middle-page1 |
| 45456 | 11/11/2015 11:52:00 | |this-end-page |
| 94476 | 11/11/2015 11:58:00 | black |this-first-page |
| 94476 | 11/11/2015 11:59:00 | braun |this-middle-page1 |
| 94476 | 11/11/2015 12:00:00 | |this-middle-page2 |
| 94476 | 11/11/2015 12:01:00 | white |this-end-page |
PS: 1)每个cust_id可能有不止一行,且visited_page_sequence如'%first-page%' 2)每个cust_id可能有不止一行,且带有visited_page_sequence,例如'%middle-page-1%'或middle-page-2或此处未列出的任何其他中间页 3)每个cust_id中的不超过一行,其中visited_page_sequence类似于“%end-page%” 4)(cust_id,timestamp)的组合没有重复项
在评论后编辑: 5)如果Visited_page_sequence中的值连续两次出现,则应返回最后一次出现!
答案 0 :(得分:0)
假设
cust_id
与visited_page_sequence like '%first-page%'
的行数不超过cust_id
与visited_page_sequence like '%end-page%'
的行数不超过(cust_id, timestamp)
的组合没有重复项您可以使用:
select t.*
from myTable f
join myTable l on l.cust_id = f.cust_id
join myTable t
on t.cust_id = f.cust_id
and t.timestamp between f.timestamp and l.timestamp
where f.visited_page_sequence like '%first-page%'
and l.visited_page_sequence like '%end-page%'
order by t.cust_id, t.timestamp
答案 1 :(得分:0)
一种方法是首先找出每个客户的最小/最大时间戳,过滤出超出此范围的行。
像这样:
/*
for each customer, find out the min/max timestamp we are interested in,
ie when they first visited a 'first-page' and last visited a 'end-page'
*/
WITH
min_max_by_customer AS (
SELECT
cust_id,
MIN(
CASE
WHEN visited_page_sequence LIKE '%first-page%' THEN timestamp
ELSE null
END
) AS min_first_page_timestamp,
MAX(
CASE
WHEN visited_page_sequence LIKE '%end-page%' THEN timestamp
ELSE null
END
) AS max_end_page_timestamp
FROM i
GROUP BY cust_id
),
/*
fetch the actual data we're interested in (ie timestamp between first-page/end-page).
also flag a row to be removed if the next row contains the same 'visited_page_sequence'
*/
rows_per_customer AS(
SELECT
i.*,
visited_page_sequence = LEAD(visited_page_sequence) OVER (PARTITION BY cust_id ORDER BY timestamp ASC) AS same_page_as_next_row
FROM i
JOIN min_max_by_customer
USING (cust_id)
WHERE i.timestamp BETWEEN min_first_page_timestamp AND max_end_page_timestamp
)
SELECT *
FROM rows_per_customer
WHERE same_page_as_next_row IS NOT TRUE /* XXX not the same as 'IS FALSE' due to SQL's three-value logic */
;
返回
┌─────────┬─────────────────────┬───────────────────────┐
│ cust_id │ timestamp │ visited_page_sequence │
├─────────┼─────────────────────┼───────────────────────┤
│ 45456 │ 2015-11-11 11:49:00 │ this-first-page │
│ 45456 │ 2015-11-11 11:50:00 │ this-middle-page1 │
│ 45456 │ 2015-11-11 11:52:00 │ this-end-page │
│ 94476 │ 2015-11-11 11:58:00 │ this-first-page │
│ 94476 │ 2015-11-11 11:59:00 │ this-middle-page1 │
│ 94476 │ 2015-11-11 12:00:00 │ this-middle-page2 │
│ 94476 │ 2015-11-11 12:01:00 │ this-end-page │
└─────────┴─────────────────────┴───────────────────────┘
(7 rows)