我需要一个包含用户,文件名和有效负载的表中的鞋子和帽子总和。如果将重复记录定义为相同的用户,有效负载以及文件名中'/'之后的部分,则应忽略重复记录。在下面的示例表中,记录#3是使用上述规则的记录#2的副本。理想的结果是鞋子和帽子的总和,如下例所示。
示例数据
+---+------+----------+-----------+
| # | User | Filename | Payload |
+---+------+----------+-----------+
| 1 | A | a/123 | Shoes = 3 |
| 2 | A | a/123 | Hats = 2 |
| 3 | A | b/123 | Hats = 2 |
| 4 | B | a/123 | Shoes = 1 |
| 5 | B | a/123 | Hats = 1 |
+---+------+----------+-----------+
预期产量
+-------+------+
| Shoes | Hats |
+-------+------+
| 4 | 3 |
+-------+------+
答案 0 :(得分:1)
蜂巢恰好支持substring_index()
,因此您可以这样做:
select sum(case when payload like 'Shoes%'
then substring_index(payload, ' = ', -1)
else 0
end) as num_shoes,
sum(case when payload like 'Hats%'
then substring_index(payload, ' = ', -1)
else 0
end) as num_hats
from (select t.*,
row_number() over (partition by user, payload, substring_index(filename, '/', -1)
order by user
) as seqnum
from t
) t
where seqnum = 1;
我强烈建议您更改数据模型,而不要将有效负载存储为字符串。数字应存储为数字。名称应存储为名称。如果可以避免,则不应将它们组合成字符串。