问题:
假设有一个简单(但很大)的表foods
id name
-- -----------
01 ginger beer
02 white wine
03 red wine
04 ginger wine
我想计算有多少条目具有特定的硬编码模式,比如包含单词&ginger?' (LIKE '%ginger%'
)或' wine' (LIKE '%wine%'
)或其他任何内容,并将这些数字写入注释行。我正在寻找的结果如下
comment total
--------------- -----
contains ginger 2
for wine lovers 3
解决方案1(格式正确但效率低下):
可以使用UNION ALL
并构建以下
SELECT * FROM
(
(
SELECT
'contains ginger' AS comment,
sum((name LIKE '%ginger%')::INT) AS total
FROM foods
)
UNION ALL
(
SELECT
'for wine lovers' AS comment,
sum((name LIKE '%wine%')::INT) AS total
FROM foods
)
)
显然它的工作方式类似于简单地执行多个查询并在之后将它们缝合在一起。这是非常低效的。
解决方案2(效率高但格式错误):
与以前的解决方案相比,以下速度要快许多倍
SELECT
sum((name LIKE '%ginger%')::INT) AS contains_ginger,
sum((name LIKE '%wine%')::INT) AS for_wine_lovers
FROM foods
结果是
contains_ginger for_wine_lovers
--------------- ---------------
2 3
所以绝对可以更快地获得相同的信息,但格式错误......
讨论:
最佳整体方法是什么?我该怎样做才能以有效的方式和更好的格式获得我想要的结果?或者这真的不可能吗?
顺便说一句,我是为Redshift编写的(基于PostgreSQL)。
感谢。
答案 0 :(得分:2)
在两个查询中都使用LIKE运算符。或者,我们可以使用Position来查找名称中硬编码单词的位置。如果名称中有硬编码的单词,则返回大于0的数字。
SELECT
unnest(array['ginger', 'wine']) AS comments,
unnest(array[ginger, wine]) AS count
FROM(
(SELECT sum(contains_ginger) ginger , sum(contains_wine) wine
FROM
(SELECT CASE WHEN Position('ginger' in name)>0
THEN 1
END contains_ginger,
CASE WHEN Position('wine' in name) > 0
THEN 1
END contains_wine
FROM foods) t) t1
答案 1 :(得分:1)
尝试尺寸:
Declare @searchTerms table (term varchar(100), comment varchar(100))
insert into @searchTerms values
('ginger','contains ginger')
,('wine','for wine lovers')
-- Add any others here
select t.comment, isnull(count(f.id),0) [total]
from @searchTerms t
left join food f on (f.name like '%'+t.term+'%')
group by t.term
order by 1
我不确定postgresql的临时表语法是什么 - 这个例子适用于MS SQL Server,但我确信你明白了这个想法
更新:根据SQLines的在线转换器,语法实际上是相同的
答案 2 :(得分:1)
选项1:手动重塑
CREATE TEMPORARY TABLE wide AS (
SELECT
sum((name LIKE '%ginger%')::INT) AS contains_ginger,
sum((name LIKE '%wine%')::INT) AS for_wine_lovers
...
FROM foods;
SELECT
'contains ginger', contains_ginger FROM wide
UNION ALL
SELECT
'for wine lovers', contains_wine FROM wine
UNION ALL
...;
选项2:创建类别表&使用连接
-- not sure if redshift supports values, hence I'm using the union all to build the table
WITH categories (category_label, food_part) AS (
SELECT 'contains ginger', 'ginger'
union all
SELECT 'for wine lovers', 'wine'
...
)
SELECT
categories.category_label, COUNT(*)
FROM categories
LEFT JOIN foods ON foods.name LIKE ('%' || categories.food_part || '%')
GROUP BY 1
由于您的解决方案2您认为速度足够快,因此选项1 应该为您工作。
选项2也应该是相当有效的,并且更容易编写&延伸,作为额外的奖励,如果给定类别中不存在任何食物,此查询将通知您。
选项3:重塑&重新分配您的数据以更好地匹配分组键。
如果查询执行时间非常重要,您还可以预处理数据集。很多这个的好处取决于您的数据量和数据分布。您是否只有几个硬类别,或者是否会从某种界面动态搜索它们。
例如:
如果数据集重塑为这样:
content name
-------- ----
ginger 01
ginger 04
beer 01
white 02
wine 02
wine 04
wine 03
然后你可以打碎&在content
上分发,每个实例可以并行执行该部分聚合。
此处等效查询可能如下所示:
WITH content_count AS (
SELECT content, COUNT(*) total
FROM reshaped_food_table
GROUP BY 1
)
SELECT
CASE content
WHEN 'ginger' THEN 'contains ginger'
WHEN 'wine' THEN 'for wine lovers'
ELSE 'other'
END category
, total
FROM content_count
答案 3 :(得分:1)
我不了解Redshift,但在Postgres中,我从这样的事情开始:
WITH foods (id, name) AS (VALUES
(1, 'ginger beer'), (2, 'white wine'), (3, 'red wine'), (4, 'ginger wine'))
SELECT hardcoded.comment, count(*)
FROM (VALUES ('%ginger%', 'contains ginger'), ('%wine%', 'for wine lovers'))
AS hardcoded (pattern, comment)
JOIN foods ON foods.name LIKE hardcoded.pattern
GROUP BY hardcoded.comment;
┌─────────────────┬───────┐
│ comment │ count │
├─────────────────┼───────┤
│ contains ginger │ 2 │
│ for wine lovers │ 3 │
└─────────────────┴───────┘
(2 rows)
如果可以,那么我会继续在foods.name上创建适当的索引。这可能包括name
和reverse(name)
上的索引;或者(name gist_trgm_ops)
,但我不希望Redshift提供pg_trgm。
答案 4 :(得分:1)
Redshift is rather limited in comparison to modern Postgres.
没有background-image: url(image1.png), url(image2.png);
background-position: center bottom, left top;
background-repeat: no-repeat;
,没有unnest()
,没有ARRAY构造函数,没有array_agg()
表达式,没有VALUES
连接,没有tablefunc模块。所有可以使这个很好的工具变得简单。至少我们有CTEs ...
这应该可以正常运行并且相对易于扩展:
LATERAL
我使用Posix operator ~
替换WITH ct AS (
SELECT a.arr
, count(name ~ arr[1] OR NULL) AS ct1
, count(name ~ arr[2] OR NULL) AS ct2
, count(name ~ arr[3] OR NULL) AS ct3
-- , ... more
FROM foods
CROSS JOIN (SELECT '{ginger, wine, bread}'::text[]) AS a(arr)
)
SELECT arr[1] AS comment, ct1 AS total FROM ct
UNION ALL SELECT arr[2], ct2 FROM ct
UNION ALL SELECT arr[3], ct3 FROM ct
-- ... more
,因为它更短,无需添加占位符LIKE
。对于Postgres中的这个简单形式,性能大致相同,不确定Redshift。
%
应该比count(boolean_expression OR NULL)
快一点。
索引将无法提高整个表格中单个顺序扫描的性能。
答案 5 :(得分:0)
一点点搜索表明您可以使用第二种方法提高效率,并将结果放入CTE中,然后按unnest()
进行System.out.println(toSend.length());
if (toSend.length() != 0) {
dos.writeBytes(toSend);
toSend="";
chatText.setText("");
,unpivot and PostgreSQL
答案 6 :(得分:0)
试试这个 -
SELECT 'contains ginger' AS comment
, Count(*) AS total
FROM foods
WHERE name LIKE '%ginger%'
UNION ALL
SELECT 'for wine lovers',
, count(*)
FROM foods
WHERE name LIKE '%wine%'
答案 7 :(得分:0)
从您的示例中,您的产品名称似乎最多包含2个字。按空格划分并检查单个块是否与like
匹配更有效,然后按照其他响应中的说明手动重新形成
WITH counts as (
SELECT
sum(('ginger' in (split_part(name,' ',1),split_part(name,' ',2)))::INT) AS contains_ginger,
sum(('wine' in (split_part(name,' ',1),split_part(name,' ',2)))::INT) AS for_wine_lovers
FROM foods
)
-- manually reshape
答案 8 :(得分:0)
您是否考虑过使用游标?
这是我为SQL Server编写的一个例子。
您只需要在SearchWordTable
表格中包含一些表格,其中包含您要搜索的所有值(我在下面的示例中将其称为SearchWord
,列名称为foods
)
CREATE TABLE #TemporaryTable
(
KeyWord nvarchar(50),
ResultCount int
);
DECLARE @searchWord nvarchar(50)
DECLARE @count INT
DECLARE statistic_cursor CURSOR FOR
SELECT SearchWord
FROM SearchWordTable
OPEN statistic_cursor
FETCH NEXT FROM statistic_cursor INTO @searchWord
WHILE @@FETCH_STATUS = 0
BEGIN
SELECT @count = COUNT(1) FROM foods
WHERE name LIKE '%'+@searchWord+'%'
INSERT INTO #TemporaryTable (KeyWord, ResultCount) VALUES (@searchWord, @count)
FETCH NEXT FROM product_cursor INTO @product
END
CLOSE product_cursor
DEALLOCATE product_cursor
SELECT * FROM #TemporaryTable
DROP #TemporaryTable
答案 9 :(得分:0)
我认为最好的选择是将成分列表分成几部分,然后计算它们。
" Pass0" .." PASS4"和"数字"只是一个Tally表来获取数字列表1..256来模拟不需要的。
"注释"是一个简单的表,你应该有一些成分和他们的意见
使用你的餐桌"食物"而不是我的;)
让我们来看看
with
Pass0 as (select '1' as C union all select '1'), --2 rows
Pass1 as (select '1' as C from Pass0 as A, Pass0 as B),--4 rows
Pass2 as (select '1' as C from Pass1 as A, Pass1 as B),--16 rows
Pass3 as (select '1' as C from Pass2 as A, Pass2 as B),--256 rows
numbers as (
select ROW_NUMBER() OVER(ORDER BY C) AS N FROM Pass3
),
comments as (
select 'ginger' ingredient, 'contains ginger' comment union all
select 'wine', 'for wine lovers' union all
select 'ale', 'a warm kind of beer' union all
select 'beer', 'your old friend'
),
foods as (
select 01 id, 'ginger beer' name union all
select 02 ,'white wine' union all
select 03 ,'red wine' union all
select 04 ,'ginger wine' union all
select 05 ,'ginger ale' union all
select 06 ,'pale ale' union all
select 07 ,'ginger beer' union all
),
ingredients as (
select ingredient, COUNT(*) n
from foods d
CROSS JOIN LATERAL(
select SPLIT_PART(d.name, ' ', n.n) ingredient
from numbers n
where SPLIT_PART(d.name, ' ', n.n)<>''
) ingredients
group by ingredient
)
select i.*, isnull(c.comment, 'no comment..') comment
from ingredients i
left join comments c on c.ingredient = i.ingredient
ingredient n comment
ale 2 a warm kind of beer
beer 2 your old friend
ginger 4 contains ginger
pale 1 no comment..
red 1 no comment..
white 1 no comment..
wine 3 for wine lovers
答案 10 :(得分:0)
你走了。
WHERE
过滤器会减少进入GROUP BY
聚合的行数。对于较小的数据,它不是必需的,但如果表格在数十亿行中,则会有所帮助。在REGEXP
过滤器和CASE
语句中添加其他模式。
SELECT CASE WHEN name LIKE '%ginger%' THEN 'contains ginger'
WHEN name LIKE '%wine%' THEN 'for wine lovers'
ELSE NULL END "comment"
,COUNT(*) total
FROM grouping_test
WHERE REGEXP_INSTR(name,'ginger|wine')
GROUP BY 1
;
答案 11 :(得分:0)
尝试使用SQL:
SELECT count(1) as total,'contains ginger' result
FROM foods where names LIKE '%ginger%'
union all
SELECT count(1),'for wine lovers'
FROM foods where names LIKE '%wine%'