SQL按多值字符串列中的不同值分组

时间:2019-03-25 01:49:15

标签: sql group-by amazon-redshift

(我想根据具有多个值的字符串列中的不同值进行分组依据

所述列具有标准格式的字符串列表,以逗号分隔。潜在值仅为a,b,c,d

例如,列collection(类型:字符串)包含:

Row 1: ["a","b"]
Row 2: ["b","c"]
Row 3: ["b","c","a"]
Row 4: ["d"]`

预期输出是唯一值的计数:

collection | count
a | 2
b | 3
c | 2
d | 1

2 个答案:

答案 0 :(得分:1)

对于以下所有内容,我都使用此表:

create table tmp (
 id INT auto_increment,
 test VARCHAR(255),
 PRIMARY KEY (id)
);

insert into tmp (test) values 
    ("a,b"),
    ("b,c"),
    ("b,c,a"),
    ("d")
;

如果可能的值仅为a,b,c,d,则可以尝试以下方法之一: 请注意,只有在您没有类似testtest_new这样的值时,这才行得通,因为这样test也会与所有test_new行和计数一起加入不匹配

select collection, COUNT(*) as count from tmp JOIN (
    select CONCAT("%", tb.collection, "%") as like_collection, collection from (
        select "a" COLLATE utf8_general_ci as collection
        union select "b" COLLATE utf8_general_ci as collection
        union select "c" COLLATE utf8_general_ci as collection
        union select "d" COLLATE utf8_general_ci as collection
    ) tb
) tb1 
ON tmp.test LIKE tb1.like_collection
GROUP BY tb1.collection;

哪个会给您想要的结果

collection | count
    a      |   2
    b      |   3
    c      |   2
    d      |   1

或者您可以尝试这个

SELECT 
   (SELECT COUNT(*) FROM tmp WHERE test LIKE '%a%') as a_count,
   (SELECT COUNT(*) FROM tmp WHERE test LIKE '%b%') as b_count,
   (SELECT COUNT(*) FROM tmp WHERE test LIKE '%c%') as c_count,
   (SELECT COUNT(*) FROM tmp WHERE test LIKE '%d%') as d_count
;

结果将是这样

a_count | b_count | c_count | d_count
2       |    3    |   2     |   1

答案 1 :(得分:1)

您需要做的是首先将集合列分解到单独的行中(例如flatMap操作)。在红移中,生成新行的唯一方法是到JOIN-因此,让我们将CROSS JOIN的输入表与具有连续数字的静态表一起使用,而仅将具有{{1 }}小于或等于集合中元素的数量。然后,我们将使用id函数以正确的索引读取项目。拥有已加载的表后,我们将做一个简单的split_part

如果您的商品存储为JSON数组字符串(GROUP BY,则可以分别使用'["a", "b", "c"]'JSON_ARRAY_LENGTH代替JSON_EXTRACT_ARRAY_ELEMENT_TEXTREGEXP_COUNT。 / p>

SPLIT_PART