我正在研究AWS Redshift上的URL提取。 URL列如下所示:
url item origin
http://B123//ajdsb apple US
http://BYHG//B123 banana UK
http://B325//BF89//BY85 candy CA
我想要得到的结果是获取以B开头的序列,并且如果URL中包含多个序列,还可以扩展行。
extracted item origin
B123 apple US
BYHG banana UK
B123 banana UK
B325 candy CA
BF89 candy CA
BY85 candy CA
我当前的代码是:
select REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})') as extracted, item, origin
from data
正则表达式部分工作正常,但在提取多个值并将其扩展到新行时遇到问题。我尝试使用REGEXP_MATCHES(url, '(B[0-9A-Z]{3})', 'g')
,但是Redshift上不存在regexp_matches函数。
答案 0 :(得分:2)
我使用的解决方案相当难看,但可以达到预期的效果。它涉及使用REGEXP_COUNT
确定一行中的最大匹配数,然后使用REGEXP_SUBSTR
将所得的数字表与查询联接。
-- Get a table with the count of matches
-- e.g. if one row has 5 matches this query will return 0, 1, 2, 3, 4, 5
WITH n_table AS (
SELECT
DISTINCT REGEXP_COUNT(url, '(B[0-9A-Z]{3})') AS n
FROM data
)
-- Join the previous table to the data table and use n in the REGEXP_SUBSTR call to get the nth match
SELECT
REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})', 1, n) AS extracted,
item,
origin
FROM data,
n_table
-- Only keep non-null matches
WHERE n > 0
AND REGEXP_COUNT(url, '(B[0-9A-Z]{3})') >= N
答案 1 :(得分:1)
IronFarm 的答案启发了我,尽管我想找到不需要交叉连接的解决方案。这是我想出的:
with
-- raw data
src as (
select
1 as id,
'abc def ghi' as stuff
union all
select
2 as id,
'qwe rty' as stuff
),
-- for each id, get a series of indexes for
-- each match in the string
match_idxs as (
select
id,
generate_series(1, regexp_count(stuff, '[a-z]{3}')) as idx
from
src
)
select
src.id,
match_idxs.idx,
regexp_substr(src.stuff, '[a-z]{3}', 1, match_idxs.idx) as stuff_match
from
src
join match_idxs using (id)
order by
id, idx
;
这将产生:
id | idx | stuff_match
----+-----+-------------
1 | 1 | abc
1 | 2 | def
1 | 3 | ghi
2 | 1 | qwe
2 | 2 | rty
(5 rows)