Vertica SQL函数用于将字符串拆分为单独的列

时间:2016-08-31 19:07:13

标签: sql split vertica

SQL中是否有一种方法可以根据字符串中的分隔符将字符串拆分为n列。我知道SPLIT_PART函数,其中有三个参数,字符串,分隔符和字符串中的第n个分隔符。例如:

select 
  split_part('2016-01-01 00:11:00|Sprout|0', '|', 1),  split_part('2016-01-01 00:11:00|Sprout|0', '|', 2), split_part('2016-01-01 00:11:00|Sprout|0', '|', 3);

有没有办法在没有第三个参数的情况下执行此操作,您只需提供字符串和分隔符,但最终会出现多少列,分隔符出现在字符串中?

一旦Vertica允许基于Python的UDF,我知道这是一个使用.split()方法的简单修复,但目前有解决方案吗?我知道这可能是一个长镜头,但我主要是出于好奇,因为使用split_part完全符合我的目的。

这不可能是一个可以接受的答案

1 个答案:

答案 0 :(得分:1)

确定。如果您很高兴获得字符串的第n个标记,请尝试:

    SQL>SELECT
    ...>  regexp_substr(
    ...>    '2016-01-01 00:11:00|Sprout|0' -- source string
    ...>  , '[|]?([^|]+)' -- pattern (an optional bar, followed by many non-bars, which we remember as the 1st group)
    ...>  , 1             -- starting from begin of string: position 1
    ...>  , 1             -- the N-th occurrence
    ...>  , ''            -- no regexp modifier
    ...>  , 1             -- we want the only remembered group - the 1st
    ...>  ) the_first
    ...>, regexp_substr(
    ...>    '2016-01-01 00:11:00|Sprout|0' -- source string
    ...>  , '[|]?([^|]+)' -- pattern (an optional bar, followed by many non-bars, which we remember as the 1st group)
    ...>  , 1             -- starting from begin of string: position 1
    ...>  , 2             -- the N-th occurrence
    ...>  , ''            -- no regexp modifier
    ...>  , 1             -- we want the only remembered group - the 1st
    ...>  ) the_second
    ...>, regexp_substr(
    ...>    '2016-01-01 00:11:00|Sprout|0' -- source string
    ...>  , '[|]?([^|]+)' -- pattern (an optional bar, followed by many non-bars, which we remember as the 1st group)
    ...>  , 1             -- starting from begin of string: position 1
    ...>  , 3             -- the N-th occurrence
    ...>  , ''            -- no regexp modifier
    ...>  , 1             -- we want the only remembered group - the 1st
    ...>  ) the_third
    ...>;
    the_first                   |the_second                  |the_third
    2016-01-01 00:11:00         |Sprout                      |0

但是如果你想转动你的分隔字符串,以便每个标记形成一个新的行 - 两种可能性:

    SQL>-- manual, using regexp_substr ...
    ...>with
    ...>the_array as (
    ...>          select  1 as idx
    ...>union all select  2
    ...>union all select  3
    ...>union all select  4
    ...>union all select  5
    ...>union all select  6
    ...>union all select  7
    ...>union all select  8
    ...>union all select  9
    ...>union all select 10 -- increase if you might get a bigger array than one of 10 elements
    ...>)
    ...> ,concepts as (
    ...>select '2016-01-01 00:11:00|Sprout|0' as concepts_list
    ...>)
    ...>select * from (
    ...>  select
    ...>   idx
    ...>  ,trim(
    ...>    regexp_substr(
    ...>     concepts_list -- source string
    ...>    ,'[|]?([^|]+)' -- pattern (an optional bar, followed by many non-bars, which we remember as the 1st group)
    ...>    ,1             -- starting from begin of string: position 1
    ...>    ,idx           -- the idx-th occurrence
    ...>    ,''            -- no regexp modifier
    ...>    ,1             -- we want the only remembered group - the 1st
    ...>    )
    ...>   ) as concept
    ...>  from concepts
    ...>  cross join the_array
    ...>) foo
    ...>where concept <> ''
    ...>;
    idx                 |concept
                       1|2016-01-01 00:11:00
                       3|0
                       2|Sprout
    select succeeded; 3 rows fetched
    SQL>-- using the strings_package on:
    ...>-- https://github.com/vertica/Vertica-Extension-Packages/blob/master/strings_package/src/StringTokenizerDelim.cpp
    ...>WITH csvtab(id,delimstring) AS (
    ...>          SELECT 1,'2016-01-01 00:11:00|Sprout|0'
    ...>UNION ALL SELECT 2,'2016-01-02 00:11:00|Trout|1'
    ...>UNION ALL SELECT 3,'2016-01-03 00:11:00|Salmon|2'
    ...>UNION ALL SELECT 4,'2016-01-04 00:11:00|Bass|3'
    ...>)
    ...>SELECT id, words
    ...>FROM (
    ...>  SELECT id, v_txtindex.StringTokenizerDelim(delimstring,'|') OVER (PARTITION by id) FROM csvtab
    ...>) a
    ...>ORDER BY 1;
    id                  |words
                       1|2016-01-01 00:11:00
                       1|Sprout
                       1|0
                       2|2016-01-02 00:11:00
                       2|Trout
                       2|1
                       3|2016-01-03 00:11:00
                       3|Salmon
                       3|2
                       4|2016-01-04 00:11:00
                       4|Bass
                       4|3
    select succeeded; 12 rows fetched