查找共有5个字段中的3个的行 - 如何加快查询速度?

时间:2011-09-02 13:19:55

标签: mysql performance

下面的查询效果很好但速度很慢。在大约7500行的表中,执行大约需要30秒。我怎么能加快速度呢?

目标是在同一个表中找到“几乎重复”的行。当5个字段中有3个匹配时,我们就会有一个匹配。

SELECT 
originalTable.id,
originalTable.lastname,
originalTable.firstname,
originalTable.address,
originalTable.city,
originalTable.email

FROM
address as originalTable,
address as compareTable

WHERE

# do not find the same record
originalTable.id != compareTable.id and

# at least 3 out of those 5 should match
(originalTable.firstname = compareTable.firstname) +
(originalTable.lastname = compareTable.lastname)  +
(originalTable.address = compareTable.address and originalTable.address != '')  +
(originalTable.city = compareTable.city and originalTable.city != '')  +
(originalTable.email = compareTable.email and originalTable.email != '')
>= 3


GROUP BY
originalTable.id

ORDER BY
originalTable.lastname asc,
originalTable.firstname asc,
originalTable.city asc

感谢任何优化提示。

2 个答案:

答案 0 :(得分:0)

这里需要笛卡尔积,这是真的。我提出了以下解决方案:

CREATE TABLE address_dups(INDEX (is_duplicate)) ENGINE=MEMORY   
SELECT 
  originalTable.id,
  compareTable.id,
(
  (originalTable.firstname = compareTable.firstname) +
  (originalTable.lastname = compareTable.lastname)  +
  (originalTable.address = compareTable.address and originalTable.address != '')  +
  (originalTable.city = compareTable.city and originalTable.city != '')  +
  (originalTable.email = compareTable.email and originalTable.email != '')
  >= 3
) AS is_duplicate
FROM 
address as originalTable,
address as compareTable
WHERE originalTable.id != compareTable.id;

SELECT * FROM address_dups WHERE is_duplicate = 1;

这将为您提供您要求的模糊重复行 ID的每一行ID。

答案 1 :(得分:0)

您已经注意到的比较需要笛卡尔......但只有PARTIAL。由于你需要在你的名字和姓氏字段中都有一个值,所以我会得到一个至少是姓氏名字的索引。然后,仅在姓氏的部分名称上为您的条件添加WHERE子句...首先说2-3个字符。这样,它只会笛卡尔式对抗那些与其余名称相同的前缀。比较“比尔琼斯”和“托尼亚史密斯”没有任何意义。但是,您可能会对公共地址,城市和/或电子邮件中的“Bill Jones”与“William Jones”感兴趣。请考虑以下名称部分进行笛卡尔比较。

(names fictitious for sample)
ID  Last     First
1   Adams    Brian
2   Adams    Marsha
3   Andrews  Jeff
4   Brown    Steve
5   Johns    Dave
6   Johnson  Bill
7   Johnson  William

如果您只在where子句中限定每个姓氏的左侧3,则会比较“Adams”。 “安德鲁斯”和“布朗”将无法与任何人进行比较。然后,从“Joh”开始的3将是笛卡尔测试...

现在,在where子句中添加一个MORE ...由于您有一个ID列,请确保它也是where子句的一部分。第二个表的ID始终大于您所在的ID。例如:比较“亚当斯”的名字。您已经知道ID1是否与ID2进行比较是否重复(在这种情况下不是),所以为什么要倒退并重新比较ID2到ID1。

因此,这7个记录的样本将导致比较

1-2
2-no more to compare against
3-no more to compare against
4-no more to compare against
5-6
5-7
6-7
7-no more to compare against

所以最后会有类似的东西(包括作为一个紧密匹配的ID作为回顾的基础......你甚至可以将所有列作为“MatchFirstName,MatchLastName,MatchCity等”进行预览。 ..)

SELECT 
      originalTable.id,
      originalTable.lastname,
      originalTable.firstname,
      originalTable.address,
      originalTable.city,
      originalTable.email,
      compareTable.ID as MatchID
    FROM
      address as originalTable,
      address as compareTable
    WHERE
          originalTable.ID < CompareTable.ID
      AND left( originalTable.LastName, 3 ) = left( CompareTable.LastName, 3 )
      AND (originalTable.firstname = compareTable.firstname) 
        + (originalTable.lastname = compareTable.lastname) 
        + (originalTable.address = compareTable.address and originalTable.address != '')  
        + (originalTable.city = compareTable.city and originalTable.city != '')  
        + (originalTable.email = compareTable.email and originalTable.email != '') >= 3