Question

我需要在redshift数据库中编写一个查询来删除列中的重复项。

select regexp_replace('GiftCard,GiftCard',  '([^,]*)(,\2)+($|,)', '\2\3')

预期结果：GiftCard

获得的结果：GiftCard,GiftCard

基本上，我想在列中搜索值，如果重复则删除。

任何人都可以帮我吗？

Answer 1

不确定这只能使用正则表达式查询来完成，但是Jon提到UDF可以很好地工作。

只需将逗号分开，创建一组唯一的单词，然后以某种格式返回。该功能类似于：

CREATE FUNCTION f_unique_words (s text)
    RETURNS text
IMMUTABLE
AS $$
    return ','.join(set(s.split(',')))
$$ LANGUAGE plpythonu;

使用示例：

> select f_unique_words('GiftCard,GiftCard');
[GiftCard]
> select f_unique_words('GiftCard,Cat,Dog,Cat,Cat,Frog,frog,GiftCard');
[frog,GiftCard,Dog,Frog,Cat]

这取决于您是否拥有对群集的适当访问权限。要创建该功能，还要确保您已使用语言“plpythonu＆＃39;为您的用户。

作为旁注，如果你想要一个不区分大小写的版本，并没有把你的所有输出都放在小写的情况下，这样做：

CREATE FUNCTION f_unique_words_ignore_case (s text)
    RETURNS text
IMMUTABLE
AS $$
    wordset = set(s.split(','))
    return ','.join(item for item in wordset if item.istitle() or item.title() not in wordset)
$$ LANGUAGE plpythonu;

使用示例：

> select f_unique_words_ignore_case('GiftCard,Cat,Dog,Cat,Cat,Frog,frog,GiftCard');
[GiftCard,Dog,Frog,Cat]

Redshift：regexp删除列数据中的重复项

1 个答案: