Postgres分组并提取具有多个元素的组

时间:2014-07-18 17:56:20

标签: sql postgresql

我想“分组”两个unicode字段(keyword_text和keyword_match_type),并提取具有两个以上元素的组的所有列和所有行。

例如,一行是:

keyword_text | keyword_norm | keyword_GAD_id| keyword_account  | keyword_MCC_id | keyword_campaign | keyword_campaign_GAD_id | keyword_ad_group | keyword_ad_group_GAD_id| keyword_destination_url | keyword_max_cpc | keyword_status | keyword_match_type | keyword_campaign_status | keyword_ad_group_status | db_id | created_at |
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
"lebanese home delivery jai", "lebanese home delivery jai", 61557127036, "IN [S_02] Cuisine", 7795189055, "IN-JAI[S[Cui_30_EN]: Lebanese", 301573516, "IN-JAI[S[Cui_30_EN|del_02|geo_01]_ex: (Lebanese) Lebanese home delivery Jaipur", 11043049036, http://www.bla.in/restaurants/index/cuisines/lebanese/city/jaipur, 480000, ENABLED, EXACT, PAUSED, PAUSED, 1, "2014-07-18 18:42:43"

表创建时使用:

CREATE TABLE adword_keywords
(
  keyword_text character varying(1000) NOT NULL,
  keyword_norm character varying(1000) NOT NULL,
  "keyword_GAD_id" bigint NOT NULL,
  keyword_account character varying NOT NULL,
  "keyword_MCC_id" bigint NOT NULL,
  keyword_campaign character varying NOT NULL,
  "keyword_campaign_GAD_id" bigint NOT NULL,
  keyword_ad_group character varying NOT NULL,
  "keyword_ad_group_GAD_id" bigint NOT NULL,
  keyword_destination_url character varying NOT NULL,
  keyword_max_cpc double precision,
  keyword_status keyword_status,
  keyword_match_type match_type,
  keyword_campaign_status keyword_c_status,
  keyword_ad_group_status keyword_ag_status,
  db_id bigserial NOT NULL,
  created_at timestamp without time zone,
  CONSTRAINT adword_keywords_pkey PRIMARY KEY (db_id)
)
WITH (
  OIDS=FALSE
);

CREATE INDEX ix_adword_keywords_keyword_norm
  ON adword_keywords
  USING btree
  (keyword_norm COLLATE pg_catalog."default");

我尝试了以下查询:

SELECT adword_keywords.*
FROM adword_keywords 
    JOIN (
        SELECT adword_keywords.keyword_text AS keyword_text,adword_keywords.keyword_match_type AS keyword_match_type 
        FROM adword_keywords GROUP BY adword_keywords.keyword_text, adword_keywords.keyword_match_type 
        HAVING count(adword_keywords.db_id) > 1) AS anon_1 
    ON adword_keywords.keyword_text = anon_1.keyword_text AND adword_keywords.keyword_match_type = anon_1.keyword_match_type 
WHERE adword_keywords.keyword_campaign_status = 'ENABLED' AND adword_keywords.keyword_ad_group_status = 'ENABLED' AND adword_keywords.keyword_status = 'ENABLED'

不幸的是,这会返回错误的结果。也意味着由一个元素组成的组(当搜索['keyword_text','match_type']时)!

有没有人知道这个查询出了什么问题?

请注意,如果我从数据库中提取所有数据并使用以下查询将其放入pandas datastructure中:

SELECT * FROM adword_keywords  
WHERE adword_keywords.keyword_campaign_status = \'ENABLED\' 
AND adword_keywords.keyword_ad_group_status = \'ENABLED\' 
AND adword_keywords.keyword_status = \'ENABLED\'

我可以过滤我想要的小组:

df.groupy(['keyword_text','match_type']).filter(lambda x: x.shape[0]>1) 

后一过程返回正确的结果。

但是,出于性能和内存问题的原因,我想对sql查询做同样的事情(数据集很大,无法完全加载到RAM中)。

修改

基于ypercube的aswer我有三个替代查询返回正确的结果。我收集了它们作为运行时间的参考:第一个版本是最快的。

使用EXISTS1 loops, best of 3: 2.22 s per loop

WITH cte AS
  ( SELECT * 
    FROM adword_keywords  
    WHERE keyword_campaign_status = 'ENABLED' 
      AND keyword_ad_group_status = 'ENABLED' 
      AND keyword_status = 'ENABLED'
  )
SELECT a.*
FROM cte AS a
WHERE EXISTS
      ( SELECT *
        FROM cte AS b
        WHERE (b.keyword_text, b.keyword_match_type) 
            = (a.keyword_text, a.keyword_match_type)
          AND b.db_id <> a.db_id
      ) ;

使用PARTITION1 loops, best of 3: 5.7 s per loop

WITH cte AS
  ( SELECT *,
           COUNT(*) OVER (PARTITION BY keyword_text, keyword_match_type) AS cnt 
    FROM adword_keywords  
    WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
        = ('ENABLED', 'ENABLED', 'ENABLED')
  )
SELECT *
FROM cte
WHERE cnt >= 2 ;

使用GROUP BY1 loops, best of 3: 5.11 s per loop

select ak.*
from
    adword_keywords ak
    inner join (
        select keyword_text, keyword_match_type
        from adword_keywords
        where
            keyword_campaign_status = 'ENABLED' AND
            keyword_ad_group_status = 'ENABLED' AND
            keyword_status = 'ENABLED'
        group by keyword_text, keyword_match_type
        having count(db_id) > 1
    ) an1 using (keyword_text, keyword_match_type)
    where
            keyword_campaign_status = 'ENABLED' AND
            keyword_ad_group_status = 'ENABLED' AND
            keyword_status = 'ENABLED'

3 个答案:

答案 0 :(得分:1)

您可以将EXISTS用于此类查询 - 因此根本不会COUNT(!),只需检查至少另一行是否存在同一个campaign_status和ad_group_status。检查主键是为了确保它是另一行:

WITH cte AS
  ( SELECT * 
    FROM adword_keywords  
    WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
        = ('ENABLED', 'ENABLED', 'ENABLED')
  )
SELECT a.*
FROM cte AS a
WHERE EXISTS
      ( SELECT *
        FROM cte AS b
        WHERE (b.keyword_text, b.keyword_match_type) 
            = (a.keyword_text, a.keyword_match_type)
          AND b.db_id <> a.db_id
      ) ;

或窗口功能:

WITH cte AS
  ( SELECT *,
           COUNT(*) OVER (PARTITION BY keyword_text, keyword_match_type) AS cnt 
    FROM adword_keywords  
    WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
        = ('ENABLED', 'ENABLED', 'ENABLED')
  )
SELECT *
FROM cte
WHERE cnt > 1 ;

您的查询无法正常运行,因为您只在外部级别拥有ENABLED条件。将它们添加到inetranl(派生表)中应该得到相同的结果:

SELECT ak.*
FROM
    adword_keywords ak
  JOIN
    ( SELECT keyword_text, keyword_match_type
      FROM adword_keywords
      WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
            = ('ENABLED', 'ENABLED', 'ENABLED')
      GROUP BY keyword_text, keyword_match_type
      HAVING COUNT(*) > 1
    ) AS d
    USING (keyword_text, keyword_match_type) 
WHERE (ak.keyword_campaign_status, ak.keyword_ad_group_status, ak.keyword_status)
    = ('ENABLED', 'ENABLED', 'ENABLED');

答案 1 :(得分:0)

您想要的是将过滤器置于计数查询中吗?

select ak.*
from
    adword_keywords ak
    inner join (
        select keyword_text, keyword_match_type
        from adword_keywords
        where
            keyword_campaign_status = 'ENABLED' AND
            keyword_ad_group_status = 'ENABLED' AND
            keyword_status = 'ENABLED'
        group by keyword_text, keyword_match_type
        having count(*) > 1
    ) an1 using (keyword_text, keyword_match_type)

没有样本数据和期望的结果,这只是猜测

答案 2 :(得分:0)

当你GROUP BY某些字段时,你正在做两件重要的事情:

  1. 你说你想要那些字段组合不同​​的行。
  2. 您说的是您使用的所有其他字段,您将要汇总。
  3. 那&#34;其他&#34;这可能是个问题。如果您按字段进行分组,则无法以您希望的方式合理地聚合该字段。

    您可以做的是计算其他字段,例如您的主键。 (你也可以说COUNT(*),我想 - 事实上,如果你没有任何独特的领域,你必须这样做。)

    例如,您的查询可能看起来像

    SELECT *
    FROM adword_keywords
        JOIN (
            SELECT keyword_text keyword_match_type 
            FROM adword_keywords
            GROUP BY keyword_text, keyword_match_type
            HAVING count(keyword_id) > 1
        ) AS duplicated USING (keyword_text, keyword_match_type)
    WHERE keyword_campaign_status = 'ENABLED'
      AND keyword_ad_group_status = 'ENABLED'
      AND keyword_status = 'ENABLED'
    

    这假定您要查找所有带有&#34; ENABLED&#34;的记录。状态,其中至少有一个副本 - 是否启用了重复。如果只想要已启用重复项的记录,则需要将这些条件添加到子查询中。 (那时你不会在外部查询中真正需要它们,因为内连接会消除未启用的行。)

    以供将来参考:如果您希望完全阻止重复项(通常是错误),您可能需要考虑在(keyword_text, keyword_match_type)上添加唯一键。< / p>