我有一个包含这些列的数据库表:
local domain email_sha256 password password_sha256
a b ... C ...
a bb ... C ...
a bb ... CC ...
a bbb ... C ...
aa bb ... CCC ...
aa bb ... CC ...
local
和domain
部分实际上是电子邮件,已在@字符处分开。
test@gmail.com
local = test
domain = gmail.com
我想找到所有具有相同local
和password
对,但具有不同domain
的行。如果我仅使用local
,domain
和password
列,则会返回类似的结果
local domain password
a b C
a bb C
a bbb C
我一直在尝试首先通过以下方式识别所有重复的local
,password
对:
SELECT local, password
FROM tablename
GROUP BY local, password
HAVING count(*) > 1
现在除了获得GROUP BY
以外的更多列之外,我还对表格本身进行了JOIN
SELECT local, domain, password
FROM tablename
JOIN (SELECT local, domain FROM tablename GROUP BY local, password HAVING count(*) > 1)
USING (local, password)
现在要确保域不同,我再次将表自身连接起来并添加一个WHERE
子句。为了避免重复,我使用了GROUP BY
。这是我的最终查询。
SELECT A.local, A.domain, A.password
FROM tablename as A
JOIN
(SELECT local, domain, password
FROM tablename
JOIN
(SELECT local, password
FROM tablename
GROUP BY local, password
HAVING count(*) > 1)
USING (local, password)) as B
USING (local, password)
WHERE A.password = B.password AND A.domain != B.domain AND A.local = B.local
GROUP BY local, domain, password
ORDER BY local, password
我要通过此查询删除潜在的有效结果吗?另外,是否有一个更快/更好的查询可以运行并达到相同的结果?
谢谢。
注意:该表没有唯一的ID,但是我可能没有重复的email_sha256
,password_sha256
对,因此它们可以用作ID。
答案 0 :(得分:1)
以下是用于BigQuery标准SQL
#standardSQL
WITH remove_dup_domains AS (
SELECT rec.* FROM (
SELECT local, domain, password, ANY_VALUE(t) rec
FROM `project.dataset.table` t
GROUP BY local, domain, password
)
)
SELECT y.* FROM (
SELECT ARRAY_AGG(t) bin
FROM remove_dup_domains t
GROUP BY local, password
HAVING COUNT(1) > 1
)x, x.bin y
您可以使用问题中的示例数据来进行测试,如上示例所示
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'a' local, 'b' domain, 'C' password, 'whatever else1' other_cols UNION ALL
SELECT 'a', 'bb', 'C', 'whatever else2' UNION ALL
SELECT 'a', 'bb', 'CC', 'whatever else3' UNION ALL
SELECT 'a', 'bbb', 'C', 'whatever else4' UNION ALL
SELECT 'a', 'bbbb', 'D', 'whatever else5' UNION ALL
SELECT 'a', 'bbbbb', 'E', 'whatever else6' UNION ALL
SELECT 'aa', 'bb', 'CCC', 'whatever else7' UNION ALL
SELECT 'aa', 'bb', 'CC', 'whatever else8' UNION ALL
SELECT 'aaa', 'com', 'H', 'whatever else9' UNION ALL
SELECT 'aaa', 'com', 'H', 'whatever else10'
), remove_dup_domains AS (
SELECT rec.* FROM (
SELECT local, domain, password, ANY_VALUE(t) rec
FROM `project.dataset.table` t
GROUP BY local, domain, password
)
)
SELECT y.* FROM (
SELECT ARRAY_AGG(t) bin
FROM remove_dup_domains t
GROUP BY local, password
HAVING COUNT(1) > 1
)x, x.bin y
有结果
Row local domain password other_cols
1 a b C whatever else1
2 a bb C whatever else2
3 a bbb C whatever else4
答案 1 :(得分:0)
我想查找具有相同本地和密码对,但具有不同域的所有行。
我认为您可以做到:
select t.* except (min_domain, max_domain)
from (select t.*,
min(domain) over (partition by local, password) as min_domain,
max(domain) over (partition by local, password) as max_domain
from tablename t
) t
where min_domain <> max_domain;