在mysql中查找重复项

时间:2012-08-31 20:43:59

标签: mysql duplicates

SELECT COUNT(organization.ID)
FROM organization
WHERE organization.NAME IN (
    SELECT organization.NAME
    FROM organization
    WHERE organization.NAME <> ''
        AND organization.APPROVED = 0 
        AND organization.CREATED_AT > '2012-07-31 04:31:08'
    GROUP BY organization.NAME
    HAVING COUNT(organization.ID) > 1
)

此查询找到重复项,问题是由于内部语句,页面加载需要6秒。有没有办法让它跑得更快? MySQL数据库版本5.1。

4 个答案:

答案 0 :(得分:1)

是。这很慢,因为MySQL在处理“in”查询时很慢。你可以改用它来代替:

SELECT COUNT(organization.ID)
FROM organization o
WHERE exists (
    SELECT organization.NAME
    FROM organization o2
    WHERE organization.NAME <> ''
        AND organization.APPROVED = 0 
        AND organization.CREATED_AT > '2012-07-31 04:31:08' and
        organization.name = o.organization.name
    GROUP BY organization.NAME
    HAVING COUNT(organization.ID) > 1
)

答案 1 :(得分:0)

尽量避免IN

SELECT COUNT(organization.ID)
FROM 
    organization
    INNER JOIN 
    (
        SELECT organization.NAME
        FROM organization
        WHERE organization.NAME <> ''
            AND organization.APPROVED = 0 
            AND organization.CREATED_AT > '2012-07-31 04:31:08'
        GROUP BY organization.NAME
        HAVING COUNT(organization.ID) > 1
    ) AS t ON organization.NAME = t.Name

答案 2 :(得分:0)

我还发现,包含db字段的索引可以极大地提高复杂查询的速度。

答案 3 :(得分:0)

如果要返回的是所有重复项的总“计数”,但仅针对那些在APPROVED和CREATED_AT上具有指定谓词的两行或更多行的组织名称,那么您可以使用备用语句来返回一个等价的结果:

SELECT SUM(c.cnt) 
  FROM ( SELECT COUNT(organization.ID) AS cnt
           FROM organization o
          WHERE o.NAME <> ''
          GROUP
             BY o.NAME
         HAVING SUM(o.APPROVED = 0 AND o.CREATED_AT > '2012-07-31 04:31:08') > 1
       ) c

MySQL可以使用合适的覆盖索引来满足此查询,否则,这可能是组织表上的完整扫描。但它避免了两次引用组织表,并避免了JOIN操作。

此查询的一个合适的覆盖索引是:

ON organization (NAME, CREATED_AT, APPROVED, ID)

请注意,如果ID列保证为非NULL(NOT NULL约束或其表的PRIMARY KEY,则可以避免引用该列,并且可以将该列保留为索引定义。)

SELECT SUM(c.cnt) 
  FROM ( SELECT SUM(1) AS cnt
           FROM organization o
          WHERE o.NAME <> ''
          GROUP
             BY o.NAME
         HAVING SUM(o.APPROVED = 0 AND o.CREATED_AT > '2012-07-31 04:31:08') > 1
       ) c

EXPLAIN输出使用索引显示此查询以满足查询而不引用表中的任何数据块:

id  select_type  table       type    possible_keys    key              key_len  ref       rows  Extra                     
--  -----------  ----------  ------  ---------------  ---------------  -------  ------  ------  --------------------------
 1  PRIMARY      <derived2>  ALL     (NULL)           (NULL)           (NULL)   (NULL)       2                            
 2  DERIVED      o           index   organization_ix  organization_ix  44       (NULL)      29  Using where; Using index