我想“分组”两个unicode字段(keyword_text和keyword_match_type),并提取具有两个以上元素的组的所有列和所有行。
例如,一行是:
keyword_text | keyword_norm | keyword_GAD_id| keyword_account | keyword_MCC_id | keyword_campaign | keyword_campaign_GAD_id | keyword_ad_group | keyword_ad_group_GAD_id| keyword_destination_url | keyword_max_cpc | keyword_status | keyword_match_type | keyword_campaign_status | keyword_ad_group_status | db_id | created_at |
________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
"lebanese home delivery jai", "lebanese home delivery jai", 61557127036, "IN [S_02] Cuisine", 7795189055, "IN-JAI[S[Cui_30_EN]: Lebanese", 301573516, "IN-JAI[S[Cui_30_EN|del_02|geo_01]_ex: (Lebanese) Lebanese home delivery Jaipur", 11043049036, http://www.bla.in/restaurants/index/cuisines/lebanese/city/jaipur, 480000, ENABLED, EXACT, PAUSED, PAUSED, 1, "2014-07-18 18:42:43"
表创建时使用:
CREATE TABLE adword_keywords
(
keyword_text character varying(1000) NOT NULL,
keyword_norm character varying(1000) NOT NULL,
"keyword_GAD_id" bigint NOT NULL,
keyword_account character varying NOT NULL,
"keyword_MCC_id" bigint NOT NULL,
keyword_campaign character varying NOT NULL,
"keyword_campaign_GAD_id" bigint NOT NULL,
keyword_ad_group character varying NOT NULL,
"keyword_ad_group_GAD_id" bigint NOT NULL,
keyword_destination_url character varying NOT NULL,
keyword_max_cpc double precision,
keyword_status keyword_status,
keyword_match_type match_type,
keyword_campaign_status keyword_c_status,
keyword_ad_group_status keyword_ag_status,
db_id bigserial NOT NULL,
created_at timestamp without time zone,
CONSTRAINT adword_keywords_pkey PRIMARY KEY (db_id)
)
WITH (
OIDS=FALSE
);
CREATE INDEX ix_adword_keywords_keyword_norm
ON adword_keywords
USING btree
(keyword_norm COLLATE pg_catalog."default");
我尝试了以下查询:
SELECT adword_keywords.*
FROM adword_keywords
JOIN (
SELECT adword_keywords.keyword_text AS keyword_text,adword_keywords.keyword_match_type AS keyword_match_type
FROM adword_keywords GROUP BY adword_keywords.keyword_text, adword_keywords.keyword_match_type
HAVING count(adword_keywords.db_id) > 1) AS anon_1
ON adword_keywords.keyword_text = anon_1.keyword_text AND adword_keywords.keyword_match_type = anon_1.keyword_match_type
WHERE adword_keywords.keyword_campaign_status = 'ENABLED' AND adword_keywords.keyword_ad_group_status = 'ENABLED' AND adword_keywords.keyword_status = 'ENABLED'
不幸的是,这会返回错误的结果。也意味着由一个元素组成的组(当搜索['keyword_text','match_type']时)!
有没有人知道这个查询出了什么问题?
请注意,如果我从数据库中提取所有数据并使用以下查询将其放入pandas datastructure中:
SELECT * FROM adword_keywords
WHERE adword_keywords.keyword_campaign_status = \'ENABLED\'
AND adword_keywords.keyword_ad_group_status = \'ENABLED\'
AND adword_keywords.keyword_status = \'ENABLED\'
我可以过滤我想要的小组:
df.groupy(['keyword_text','match_type']).filter(lambda x: x.shape[0]>1)
后一过程返回正确的结果。
但是,出于性能和内存问题的原因,我想对sql查询做同样的事情(数据集很大,无法完全加载到RAM中)。
基于ypercube的aswer我有三个替代查询返回正确的结果。我收集了它们作为运行时间的参考:第一个版本是最快的。
使用EXISTS
,1 loops, best of 3: 2.22 s per loop
:
WITH cte AS
( SELECT *
FROM adword_keywords
WHERE keyword_campaign_status = 'ENABLED'
AND keyword_ad_group_status = 'ENABLED'
AND keyword_status = 'ENABLED'
)
SELECT a.*
FROM cte AS a
WHERE EXISTS
( SELECT *
FROM cte AS b
WHERE (b.keyword_text, b.keyword_match_type)
= (a.keyword_text, a.keyword_match_type)
AND b.db_id <> a.db_id
) ;
使用PARTITION
,1 loops, best of 3: 5.7 s per loop
WITH cte AS
( SELECT *,
COUNT(*) OVER (PARTITION BY keyword_text, keyword_match_type) AS cnt
FROM adword_keywords
WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED')
)
SELECT *
FROM cte
WHERE cnt >= 2 ;
使用GROUP BY
,1 loops, best of 3: 5.11 s per loop
:
select ak.*
from
adword_keywords ak
inner join (
select keyword_text, keyword_match_type
from adword_keywords
where
keyword_campaign_status = 'ENABLED' AND
keyword_ad_group_status = 'ENABLED' AND
keyword_status = 'ENABLED'
group by keyword_text, keyword_match_type
having count(db_id) > 1
) an1 using (keyword_text, keyword_match_type)
where
keyword_campaign_status = 'ENABLED' AND
keyword_ad_group_status = 'ENABLED' AND
keyword_status = 'ENABLED'
答案 0 :(得分:1)
您可以将EXISTS
用于此类查询 - 因此根本不会COUNT
(!),只需检查至少另一行是否存在同一个campaign_status和ad_group_status。检查主键是为了确保它是另一行:
WITH cte AS
( SELECT *
FROM adword_keywords
WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED')
)
SELECT a.*
FROM cte AS a
WHERE EXISTS
( SELECT *
FROM cte AS b
WHERE (b.keyword_text, b.keyword_match_type)
= (a.keyword_text, a.keyword_match_type)
AND b.db_id <> a.db_id
) ;
或窗口功能:
WITH cte AS
( SELECT *,
COUNT(*) OVER (PARTITION BY keyword_text, keyword_match_type) AS cnt
FROM adword_keywords
WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED')
)
SELECT *
FROM cte
WHERE cnt > 1 ;
您的查询无法正常运行,因为您只在外部级别拥有ENABLED条件。将它们添加到inetranl(派生表)中应该得到相同的结果:
SELECT ak.*
FROM
adword_keywords ak
JOIN
( SELECT keyword_text, keyword_match_type
FROM adword_keywords
WHERE (keyword_campaign_status, keyword_ad_group_status, keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED')
GROUP BY keyword_text, keyword_match_type
HAVING COUNT(*) > 1
) AS d
USING (keyword_text, keyword_match_type)
WHERE (ak.keyword_campaign_status, ak.keyword_ad_group_status, ak.keyword_status)
= ('ENABLED', 'ENABLED', 'ENABLED');
答案 1 :(得分:0)
您想要的是将过滤器置于计数查询中吗?
select ak.*
from
adword_keywords ak
inner join (
select keyword_text, keyword_match_type
from adword_keywords
where
keyword_campaign_status = 'ENABLED' AND
keyword_ad_group_status = 'ENABLED' AND
keyword_status = 'ENABLED'
group by keyword_text, keyword_match_type
having count(*) > 1
) an1 using (keyword_text, keyword_match_type)
没有样本数据和期望的结果,这只是猜测
答案 2 :(得分:0)
当你GROUP BY
某些字段时,你正在做两件重要的事情:
那&#34;其他&#34;这可能是个问题。如果您按字段进行分组,则无法以您希望的方式合理地聚合该字段。
您可以做的是计算其他字段,例如您的主键。 (你也可以说COUNT(*)
,我想 - 事实上,如果你没有任何独特的领域,你必须这样做。)
例如,您的查询可能看起来像
SELECT *
FROM adword_keywords
JOIN (
SELECT keyword_text keyword_match_type
FROM adword_keywords
GROUP BY keyword_text, keyword_match_type
HAVING count(keyword_id) > 1
) AS duplicated USING (keyword_text, keyword_match_type)
WHERE keyword_campaign_status = 'ENABLED'
AND keyword_ad_group_status = 'ENABLED'
AND keyword_status = 'ENABLED'
这假定您要查找所有带有&#34; ENABLED&#34;的记录。状态,其中至少有一个副本 - 是否启用了重复。如果只想要已启用重复项的记录,则需要将这些条件添加到子查询中。 (那时你不会在外部查询中真正需要它们,因为内连接会消除未启用的行。)
以供将来参考:如果您希望完全阻止重复项(通常是错误),您可能需要考虑在(keyword_text, keyword_match_type)
上添加唯一键。< / p>