从逗号分隔的字符串中删除重复项(Amazon Redshift)

时间:2016-10-07 04:51:15

标签: sql amazon-redshift

我正在使用Amazon Redshift。

我在该字符串中有一个列以逗号分隔存储,如Private, Private, Private, Private, Private, Private, United Healthcare。我想使用query从中删除重复项,因此结果应为Private, United Healthcare。我从Stackoverflow中发现了一些明显的解决方案,并且知道可以使用正则表达式。

因此,我尝试过使用:

SELECT  regexp_replace('Private, Private, Private, Private, Private, Private, United Healthcare', '([^,]+)(,\1)+', '\1') AS insurances; 

SELECT  regexp_replace('Private, Private, Private, Private, Private, Private, United Healthcare', '([^,]+)(,\1)+', '\g') AS insurances; 

还有其他一些正则表达式,但似乎无效。任何解决方案?

3 个答案:

答案 0 :(得分:2)

试试这种方式,

SELECT  array_agg(DISTINCT insurances) 
FROM (SELECT  regexp_split_to_table('Private, Private, Private, Private, Private, Private, United Healthcare'
              , ',\s+') AS insurances) x;

替代方式

SELECT DISTINCT UNNEST(regexp_split_to_array('Private, Private, Private, Private, Private, Private, United Healthcare', ',\s+')) AS insurances;

检查http://docs.aws.amazon.com/redshift/latest/dg/String_functions_header.html两者都会因红移而失败,其中没有一个将text转换为text[]

答案 1 :(得分:2)

替代选项是尝试Python UDF。简单的Python函数重复删除字符串并返回正确的版本。

答案 2 :(得分:2)

以下是Amazon Redshift的用户定义函数(UDF)

CREATE FUNCTION f_uniquify (s text)
  RETURNS text
IMMUTABLE
AS $$
  -- Split string by comma-space, remove duplicates, convert back to comma-separated
  return ', '.join(set(s.split(', ')))
$$ LANGUAGE plpythonu;

用以下方法测试:

select f_uniquify('Private, Private, Private, Private, Private, Private, United Healthcare');

返回:

United Healthcare, Private

如果返回值的顺序很重要,则需要更具体的代码。