如何有效地将不同的值计入SQL中的不同行?

时间:2017-08-07 08:45:37

标签: sql database postgresql performance amazon-redshift

问题:

假设有一个简单(但很大)的表foods

id   name 
--   -----------  
01   ginger beer
02   white wine
03   red wine
04   ginger wine

我想计算有多少条目具有特定的硬编码模式,比如包含单词&ginger?' (LIKE '%ginger%')或' wine' (LIKE '%wine%')或其他任何内容,并将这些数字写入注释行。我正在寻找的结果如下

comment           total 
---------------   -----  
contains ginger   2
for wine lovers   3

解决方案1(格式正确但效率低下):

可以使用UNION ALL并构建以下

SELECT * FROM
(
  (
    SELECT
      'contains ginger' AS comment,
      sum((name LIKE '%ginger%')::INT) AS total
    FROM foods
  )
  UNION ALL
  (
    SELECT
      'for wine lovers' AS comment,
      sum((name LIKE '%wine%')::INT) AS total
    FROM foods
  )
)

显然它的工作方式类似于简单地执行多个查询并在之后将它们缝合在一起。这是非常低效的。

解决方案2(效率高但格式错误):

与以前的解决方案相比,以下速度要快许多倍

SELECT
  sum((name LIKE '%ginger%')::INT) AS contains_ginger,
  sum((name LIKE '%wine%')::INT) AS for_wine_lovers
FROM foods

结果是

contains_ginger   for_wine_lovers 
---------------   ---------------  
2                 3

所以绝对可以更快地获得相同的信息,但格式错误......

讨论:

最佳整体方法是什么?我该怎样做才能以有效的方式和更好的格式获得我想要的结果?或者这真的不可能吗?

顺便说一句,我是为Redshift编写的(基于PostgreSQL)。

感谢。

12 个答案:

答案 0 :(得分:2)

在两个查询中都使用LIKE运算符。或者,我们可以使用Position来查找名称中硬编码单词的位置。如果名称中有硬编码的单词,则返回大于0的数字。

SELECT 
       unnest(array['ginger', 'wine']) AS comments,
       unnest(array[ginger, wine]) AS count
FROM(
     (SELECT sum(contains_ginger) ginger , sum(contains_wine) wine
        FROM
             (SELECT CASE WHEN Position('ginger' in name)>0 
                          THEN 1 
                           END contains_ginger,
                     CASE WHEN Position('wine' in name) > 0 
                          THEN 1
                           END contains_wine
                 FROM foods) t) t1

答案 1 :(得分:1)

尝试尺寸:

Declare @searchTerms table (term varchar(100), comment varchar(100))
insert into @searchTerms values
('ginger','contains ginger')
,('wine','for wine lovers')
-- Add any others here

select t.comment, isnull(count(f.id),0) [total]
from @searchTerms t
left join food f on (f.name like '%'+t.term+'%')
group by t.term
order by 1

我不确定postgresql的临时表语法是什么 - 这个例子适用于MS SQL Server,但我确信你明白了这个想法

更新:根据SQLines的在线转换器,语法实际上是相同的

答案 2 :(得分:1)

选项1:手动重塑

CREATE TEMPORARY TABLE wide AS (
  SELECT
    sum((name LIKE '%ginger%')::INT) AS contains_ginger,
    sum((name LIKE '%wine%')::INT) AS for_wine_lovers
    ...
  FROM foods;
SELECT
  'contains ginger', contains_ginger FROM wide

UNION ALL
SELECT 
  'for wine lovers', contains_wine FROM wine

UNION ALL
...;

选项2:创建类别表&使用连接

-- not sure if redshift supports values, hence I'm using the union all to build the table
WITH categories (category_label, food_part) AS (
    SELECT 'contains ginger', 'ginger'
    union all
    SELECT 'for wine lovers', 'wine'
    ...
)
SELECT
categories.category_label, COUNT(*)
FROM categories
LEFT JOIN foods ON foods.name LIKE ('%' || categories.food_part || '%')
GROUP BY 1

由于您的解决方案2您认为速度足够快,因此选项1 应该为您工作。

选项2也应该是相当有效的,并且更容易编写&延伸,作为额外的奖励,如果给定类别中不存在任何食物,此查询将通知您。

选项3:重塑&重新分配您的数据以更好地匹配分组键。

如果查询执行时间非常重要,您还可以预处理数据集。很多这个的好处取决于您的数据量和数据分布。您是否只有几个硬类别,或者是否会从某种界面动态搜索它们。

例如:

如果数据集重塑为这样:

content   name 
--------  ----
ginger    01
ginger    04
beer      01
white     02
wine      02 
wine      04
wine      03

然后你可以打碎&在content上分发,每个实例可以并行执行该部分聚合。

此处等效查询可能如下所示:

WITH content_count AS (
  SELECT content, COUNT(*) total
  FROM reshaped_food_table 
  GROUP BY 1
)
SELECT
    CASE content 
      WHEN 'ginger' THEN 'contains ginger'
      WHEN 'wine' THEN 'for wine lovers'
      ELSE 'other' 
    END category
  , total
FROM content_count

答案 3 :(得分:1)

我不了解Redshift,但在Postgres中,我从这样的事情开始:

WITH foods (id, name) AS (VALUES 
  (1, 'ginger beer'), (2, 'white wine'), (3, 'red wine'), (4, 'ginger wine'))
SELECT hardcoded.comment, count(*)
FROM (VALUES ('%ginger%', 'contains ginger'), ('%wine%', 'for wine lovers'))
  AS hardcoded (pattern, comment)
JOIN foods ON foods.name LIKE hardcoded.pattern
GROUP BY hardcoded.comment;

┌─────────────────┬───────┐
│     comment     │ count │
├─────────────────┼───────┤
│ contains ginger │     2 │
│ for wine lovers │     3 │
└─────────────────┴───────┘
(2 rows)

如果可以,那么我会继续在foods.name上创建适当的索引。这可能包括namereverse(name)上的索引;或者(name gist_trgm_ops),但我不希望Redshift提供pg_trgm。

答案 4 :(得分:1)

Redshift is rather limited in comparison to modern Postgres.
没有background-image: url(image1.png), url(image2.png); background-position: center bottom, left top; background-repeat: no-repeat; ,没有unnest(),没有ARRAY构造函数,没有array_agg()表达式,没有VALUES连接,没有tablefunc模块。所有可以使这个很好的工具变得简单。至少我们有CTEs ...

这应该可以正常运行并且相对易于扩展:

LATERAL

我使用Posix operator ~替换WITH ct AS ( SELECT a.arr , count(name ~ arr[1] OR NULL) AS ct1 , count(name ~ arr[2] OR NULL) AS ct2 , count(name ~ arr[3] OR NULL) AS ct3 -- , ... more FROM foods CROSS JOIN (SELECT '{ginger, wine, bread}'::text[]) AS a(arr) ) SELECT arr[1] AS comment, ct1 AS total FROM ct UNION ALL SELECT arr[2], ct2 FROM ct UNION ALL SELECT arr[3], ct3 FROM ct -- ... more ,因为它更短,无需添加占位符LIKE。对于Postgres中的这个简单形式,性能大致相同,不确定Redshift。

%应该比count(boolean_expression OR NULL)快一点。

索引将无法提高整个表格中单个顺序扫描的性能。

答案 5 :(得分:0)

一点点搜索表明您可以使用第二种方法提高效率,并将结果放入CTE中,然后按unnest()进行System.out.println(toSend.length()); if (toSend.length() != 0) { dos.writeBytes(toSend); toSend=""; chatText.setText(""); unpivot and PostgreSQL

答案 6 :(得分:0)

试试这个 -

SELECT 'contains ginger' AS comment
      , Count(*) AS total
FROM foods
WHERE name LIKE '%ginger%'
UNION ALL
SELECT 'for wine lovers',
      , count(*)
FROM foods
WHERE name LIKE '%wine%'

答案 7 :(得分:0)

从您的示例中,您的产品名称似乎最多包含2个字。按空格划分并检查单个块是否与like匹配更有效,然后按照其他响应中的说明手动重新形成

WITH counts as (
    SELECT 
      sum(('ginger' in (split_part(name,' ',1),split_part(name,' ',2)))::INT) AS contains_ginger,
      sum(('wine' in (split_part(name,' ',1),split_part(name,' ',2)))::INT) AS for_wine_lovers
    FROM foods
)
-- manually reshape

答案 8 :(得分:0)

您是否考虑过使用游标?

这是我为SQL Server编写的一个例子。

您只需要在SearchWordTable表格中包含一些表格,其中包含您要搜索的所有值(我在下面的示例中将其称为SearchWord,列名称为foods

CREATE TABLE #TemporaryTable 
(
    KeyWord nvarchar(50),
    ResultCount int
);

DECLARE @searchWord nvarchar(50)
DECLARE @count INT

DECLARE statistic_cursor CURSOR FOR   
SELECT SearchWord
FROM SearchWordTable

OPEN statistic_cursor  
FETCH NEXT FROM statistic_cursor INTO @searchWord  

WHILE @@FETCH_STATUS = 0  
BEGIN  
    SELECT @count = COUNT(1) FROM foods
    WHERE name LIKE '%'+@searchWord+'%'

    INSERT INTO #TemporaryTable (KeyWord, ResultCount) VALUES (@searchWord, @count)

    FETCH NEXT FROM product_cursor INTO @product  
END  

CLOSE product_cursor  
DEALLOCATE product_cursor

SELECT * FROM #TemporaryTable

DROP #TemporaryTable

答案 9 :(得分:0)

我认为最好的选择是将成分列表分成几部分,然后计算它们。

" Pass0" .." PASS4"和"数字"只是一个Tally表来获取数字列表1..256来模拟不需要的。

"注释"是一个简单的表,你应该有一些成分和他们的意见

使用你的餐桌"食物"而不是我的;)

让我们来看看

with
Pass0 as (select '1' as C union all select '1'), --2 rows
Pass1 as (select '1' as C from Pass0 as A, Pass0 as B),--4 rows
Pass2 as (select '1' as C from Pass1 as A, Pass1 as B),--16 rows
Pass3 as (select '1' as C from Pass2 as A, Pass2 as B),--256 rows
numbers as (
    select ROW_NUMBER() OVER(ORDER BY C) AS N FROM Pass3
),    
comments as (
    select 'ginger' ingredient, 'contains ginger' comment union all 
    select 'wine', 'for wine lovers' union all 
    select 'ale', 'a warm kind of beer' union all 
    select 'beer', 'your old friend'
),
foods as (
    select 01 id, 'ginger beer' name union all 
    select 02   ,'white wine' union all 
    select 03   ,'red wine' union all 
    select 04   ,'ginger wine' union all 
    select 05   ,'ginger ale' union all 
    select 06   ,'pale ale' union all 
    select 07   ,'ginger beer' union all 
),
ingredients as (
    select ingredient, COUNT(*) n
    from foods d
    CROSS JOIN LATERAL(
        select SPLIT_PART(d.name, ' ', n.n) ingredient
        from numbers n
        where SPLIT_PART(d.name, ' ', n.n)<>''
    ) ingredients
    group by ingredient
)
select i.*, isnull(c.comment, 'no comment..') comment
from ingredients i
left join comments c on c.ingredient = i.ingredient

ingredient  n   comment
ale         2   a warm kind of beer
beer        2   your old friend
ginger      4   contains ginger
pale        1   no comment..
red         1   no comment..
white       1   no comment..
wine        3   for wine lovers

答案 10 :(得分:0)

你走了。

WHERE过滤器会减少进入GROUP BY聚合的行数。对于较小的数据,它不是必需的,但如果表格在数十亿行中,则会有所帮助。在REGEXP过滤器和CASE语句中添加其他模式。

SELECT CASE WHEN name LIKE '%ginger%' THEN 'contains ginger' 
            WHEN name LIKE '%wine%'   THEN 'for wine lovers'
       ELSE NULL END "comment"
      ,COUNT(*) total
FROM grouping_test
WHERE REGEXP_INSTR(name,'ginger|wine')
GROUP BY 1
;

答案 11 :(得分:0)

尝试使用SQL:

SELECT count(1) as total,'contains ginger' result
FROM foods where names LIKE '%ginger%' 
union all
SELECT count(1),'for wine lovers' 
FROM foods where names LIKE '%wine%'