我在表单中有一个数据集。
id | attribute
-----------------
1 | a
2 | b
2 | a
2 | a
3 | c
期望的输出:
attribute| num
-------------------
a | 1
b,a | 1
c | 1
在MySQL中,我会使用:
select attribute, count(*) num
from
(select id, group_concat(distinct attribute) attribute from dataset group by id) as subquery
group by attribute;
我不确定这可以在Redshift中完成,因为它不支持group_concat或任何psql组聚合函数,如array_agg()或string_agg()。见this question。
另一种可行的解决方案是,如果我有办法从每个组中选择一个随机属性而不是group_concat。这怎么可以在Redshift中工作?
答案 0 :(得分:2)
我找到了为每个id获取随机属性的方法,但它太棘手了。实际上我认为这不是一个好方法,但它确实有效。
SQL:
-- (1) uniq dataset
WITH uniq_dataset as (select * from dataset group by id, attr)
SELECT
uds.id, rds.attr
FROM
-- (2) generate random rank for each id
(select id, round((random() * ((select count(*) from uniq_dataset iuds where iuds.id = ouds.id) - 1))::numeric, 0) + 1 as random_rk from (select distinct id from uniq_dataset) ouds) uds,
-- (3) rank table
(select rank() over(partition by id order by attr) as rk, id ,attr from uniq_dataset) rds
WHERE
uds.id = rds.id
AND
uds.random_rk = rds.rk
ORDER BY
uds.id;
结果:
id | attr
----+------
1 | a
2 | a
3 | c
OR
id | attr
----+------
1 | a
2 | b
3 | c
以下是此SQL中的表。
-- dataset (original table)
id | attr
----+------
1 | a
2 | b
2 | a
2 | a
3 | c
-- (1) uniq dataset
id | attr
----+------
1 | a
2 | a
2 | b
3 | c
-- (2) generate random rank for each id
id | random_rk
----+----
1 | 1
2 | 1 <- 1 or 2
3 | 1
-- (3) rank table
rk | id | attr
----+----+------
1 | 1 | a
1 | 2 | a
2 | 2 | b
1 | 3 | c
答案 1 :(得分:0)
这个解决方案受Masashi的启发,更简单,可以在Redshift中从一个组中选择一个随机元素。
SELECT id, first_value as attribute
FROM(SELECT id, FIRST_VALUE(attribute)
OVER(PARTITION BY id ORDER BY random()
ROWS BETWEEN unbounded preceding AND unbounded following)
FROM dataset)
GROUP BY id, attribute ORDER BY id;
答案 2 :(得分:0)
这是相关问题here的答案。这个问题已经结束,所以我在这里发布答案。
以下是将列聚合为字符串的方法:
select * from temp;
attribute
-----------
a
c
b
1)为每一行赋予唯一的排名
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select * from sub_table;
attribute | rnk
-----------+-----
a | 1
b | 2
c | 3
2)使用concat运算符||合并成一行
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
(select attribute from sub_table where rnk = 2)||
(select attribute from sub_table where rnk = 3) res_string;
res_string
------------
abc
这仅适用于该列中有限数量的行(X)。它可以是“order by”子句中某些属性排序的前X行。我猜这很贵。
Case语句可用于处理当某个rank不存在时发生的NULL。
with sub_table as(select attribute, rank() over (order by attribute) rnk from temp)
select (select attribute from sub_table where rnk = 1)||
(select attribute from sub_table where rnk = 2)||
(select attribute from sub_table where rnk = 3)||
(case when (select attribute from sub_table where rnk = 4) is NULL then ''
else (select attribute from sub_table where rnk = 4) end) as res_string;
答案 3 :(得分:-2)
我没有测试过这个查询,但Redshift支持这些功能:
select id, arrary_to_string(array(select attribute from mydataset m where m.id=d.id),',')
from mydataset d