Question

我正在研究AWS Redshift上的URL提取。 URL列如下所示：

url                       item     origin
http://B123//ajdsb        apple    US
http://BYHG//B123         banana   UK
http://B325//BF89//BY85   candy    CA

我想要得到的结果是获取以B开头的序列，并且如果URL中包含多个序列，还可以扩展行。

extracted    item     origin
B123         apple    US
BYHG         banana   UK
B123         banana   UK
B325         candy    CA
BF89         candy    CA
BY85         candy    CA

我当前的代码是：

select REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})') as extracted, item, origin
from data

正则表达式部分工作正常，但在提取多个值并将其扩展到新行时遇到问题。我尝试使用REGEXP_MATCHES(url, '(B[0-9A-Z]{3})', 'g')，但是Redshift上不存在regexp_matches函数。

Answer 1

我使用的解决方案相当难看，但可以达到预期的效果。它涉及使用REGEXP_COUNT确定一行中的最大匹配数，然后使用REGEXP_SUBSTR将所得的数字表与查询联接。

-- Get a table with the count of matches
-- e.g. if one row has 5 matches this query will return 0, 1, 2, 3, 4, 5
WITH n_table AS (
    SELECT
        DISTINCT REGEXP_COUNT(url, '(B[0-9A-Z]{3})') AS n
    FROM data
)
-- Join the previous table to the data table and use n in the REGEXP_SUBSTR call to get the nth match
SELECT
    REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})', 1, n) AS extracted,
    item,
    origin
FROM data,
     n_table
-- Only keep non-null matches
WHERE n > 0
  AND REGEXP_COUNT(url, '(B[0-9A-Z]{3})') >= N

Answer 2

IronFarm 的答案启发了我，尽管我想找到不需要交叉连接的解决方案。这是我想出的：

with 

-- raw data
src as (
  select 
    1 as id,
    'abc def ghi' as stuff
  union all 
  select
    2 as id,
    'qwe rty' as stuff
),

-- for each id, get a series of indexes for
-- each match in the string
match_idxs as (
  select
    id,
    generate_series(1, regexp_count(stuff, '[a-z]{3}')) as idx
  from
    src
)

select 
  src.id,
  match_idxs.idx,
  regexp_substr(src.stuff, '[a-z]{3}', 1, match_idxs.idx) as stuff_match
from 
  src 
  join match_idxs using (id)
order by 
  id, idx
;

这将产生：

 id | idx | stuff_match
----+-----+-------------
  1 |   1 | abc
  1 |   2 | def
  1 |   3 | ghi
  2 |   1 | qwe
  2 |   2 | rty
(5 rows)

redshift regex获得多个匹配项并扩展行

2 个答案: