Question

我有一个包含这些列的数据库表：

local  domain  email_sha256  password  password_sha256
a      b       ...           C         ...
a      bb      ...           C         ...
a      bb      ...           CC        ...
a      bbb     ...           C         ...
aa     bb      ...           CCC       ...
aa     bb      ...           CC        ...

local和domain部分实际上是电子邮件，已在@字符处分开。

test@gmail.com

local = test

domain = gmail.com

我想找到所有具有相同local和password对，但具有不同domain的行。如果我仅使用local，domain和password列，则会返回类似的结果

local  domain  password
a      b       C
a      bb      C
a      bbb     C

我一直在尝试首先通过以下方式识别所有重复的local，password对：

SELECT local, password 
FROM tablename
GROUP BY local, password
HAVING count(*) > 1

现在除了获得GROUP BY以外的更多列之外，我还对表格本身进行了JOIN

SELECT local, domain, password 
FROM tablename
JOIN (SELECT local, domain FROM tablename GROUP BY local, password HAVING count(*) > 1)
USING (local, password)

现在要确保域不同，我再次将表自身连接起来并添加一个WHERE子句。为了避免重复，我使用了GROUP BY。这是我的最终查询。

SELECT A.local, A.domain, A.password
FROM tablename as A
JOIN 
    (SELECT  local, domain, password 
    FROM tablename
    JOIN 
        (SELECT local, password 
        FROM tablename 
        GROUP BY local, password 
        HAVING count(*) > 1) 
    USING (local, password)) as B
USING (local, password)
WHERE A.password = B.password AND A.domain != B.domain AND A.local = B.local
GROUP BY local, domain, password
ORDER BY local, password

我要通过此查询删除潜在的有效结果吗？另外，是否有一个更快/更好的查询可以运行并达到相同的结果？

谢谢。

注意：该表没有唯一的ID，但是我可能没有重复的email_sha256，password_sha256对，因此它们可以用作ID。

Answer 1

以下是用于BigQuery标准SQL

#standardSQL
WITH remove_dup_domains AS (
  SELECT rec.* FROM (
    SELECT local, domain, password, ANY_VALUE(t) rec
    FROM `project.dataset.table` t
    GROUP BY local, domain, password
  )
)
SELECT y.* FROM (
  SELECT ARRAY_AGG(t) bin 
  FROM remove_dup_domains t
  GROUP BY local, password
  HAVING COUNT(1) > 1
)x, x.bin y

您可以使用问题中的示例数据来进行测试，如上示例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'a' local, 'b' domain, 'C' password, 'whatever else1' other_cols UNION ALL
  SELECT 'a', 'bb', 'C', 'whatever else2' UNION ALL
  SELECT 'a', 'bb', 'CC', 'whatever else3' UNION ALL
  SELECT 'a', 'bbb', 'C', 'whatever else4' UNION ALL
  SELECT 'a', 'bbbb', 'D', 'whatever else5' UNION ALL
  SELECT 'a', 'bbbbb', 'E', 'whatever else6' UNION ALL
  SELECT 'aa', 'bb', 'CCC', 'whatever else7' UNION ALL
  SELECT 'aa', 'bb', 'CC', 'whatever else8' UNION ALL
  SELECT 'aaa', 'com', 'H', 'whatever else9' UNION ALL
  SELECT 'aaa', 'com', 'H', 'whatever else10' 
), remove_dup_domains AS (
  SELECT rec.* FROM (
    SELECT local, domain, password, ANY_VALUE(t) rec
    FROM `project.dataset.table` t
    GROUP BY local, domain, password
  )
)
SELECT y.* FROM (
  SELECT ARRAY_AGG(t) bin 
  FROM remove_dup_domains t
  GROUP BY local, password
  HAVING COUNT(1) > 1
)x, x.bin y

有结果

Row local   domain  password    other_cols   
1   a       b       C           whatever else1   
2   a       bb      C           whatever else2   
3   a       bbb     C           whatever else4

Answer 2

我想查找具有相同本地和密码对，但具有不同域的所有行。

我认为您可以做到：

select t.* except (min_domain, max_domain)
from (select t.*,
             min(domain) over (partition by local, password) as min_domain,
             max(domain) over (partition by local, password) as max_domain
      from tablename t
     ) t
where min_domain <> max_domain;

选择具有两列的行，但如果有多个元组，则选择另一行

2 个答案: