下面的查询效果很好但速度很慢。在大约7500行的表中,执行大约需要30秒。我怎么能加快速度呢?
目标是在同一个表中找到“几乎重复”的行。当5个字段中有3个匹配时,我们就会有一个匹配。
SELECT
originalTable.id,
originalTable.lastname,
originalTable.firstname,
originalTable.address,
originalTable.city,
originalTable.email
FROM
address as originalTable,
address as compareTable
WHERE
# do not find the same record
originalTable.id != compareTable.id and
# at least 3 out of those 5 should match
(originalTable.firstname = compareTable.firstname) +
(originalTable.lastname = compareTable.lastname) +
(originalTable.address = compareTable.address and originalTable.address != '') +
(originalTable.city = compareTable.city and originalTable.city != '') +
(originalTable.email = compareTable.email and originalTable.email != '')
>= 3
GROUP BY
originalTable.id
ORDER BY
originalTable.lastname asc,
originalTable.firstname asc,
originalTable.city asc
感谢任何优化提示。
答案 0 :(得分:0)
这里需要笛卡尔积,这是真的。我提出了以下解决方案:
CREATE TABLE address_dups(INDEX (is_duplicate)) ENGINE=MEMORY
SELECT
originalTable.id,
compareTable.id,
(
(originalTable.firstname = compareTable.firstname) +
(originalTable.lastname = compareTable.lastname) +
(originalTable.address = compareTable.address and originalTable.address != '') +
(originalTable.city = compareTable.city and originalTable.city != '') +
(originalTable.email = compareTable.email and originalTable.email != '')
>= 3
) AS is_duplicate
FROM
address as originalTable,
address as compareTable
WHERE originalTable.id != compareTable.id;
SELECT * FROM address_dups WHERE is_duplicate = 1;
这将为您提供您要求的模糊重复行 ID的每一行ID。
答案 1 :(得分:0)
您已经注意到的比较需要笛卡尔......但只有PARTIAL。由于你需要在你的名字和姓氏字段中都有一个值,所以我会得到一个至少是姓氏名字的索引。然后,仅在姓氏的部分名称上为您的条件添加WHERE子句...首先说2-3个字符。这样,它只会笛卡尔式对抗那些与其余名称相同的前缀。比较“比尔琼斯”和“托尼亚史密斯”没有任何意义。但是,您可能会对公共地址,城市和/或电子邮件中的“Bill Jones”与“William Jones”感兴趣。请考虑以下名称部分进行笛卡尔比较。
(names fictitious for sample)
ID Last First
1 Adams Brian
2 Adams Marsha
3 Andrews Jeff
4 Brown Steve
5 Johns Dave
6 Johnson Bill
7 Johnson William
如果您只在where子句中限定每个姓氏的左侧3,则会比较“Adams”。 “安德鲁斯”和“布朗”将无法与任何人进行比较。然后,从“Joh”开始的3将是笛卡尔测试...
现在,在where子句中添加一个MORE ...由于您有一个ID列,请确保它也是where子句的一部分。第二个表的ID始终大于您所在的ID。例如:比较“亚当斯”的名字。您已经知道ID1是否与ID2进行比较是否重复(在这种情况下不是),所以为什么要倒退并重新比较ID2到ID1。
因此,这7个记录的样本将导致比较
1-2
2-no more to compare against
3-no more to compare against
4-no more to compare against
5-6
5-7
6-7
7-no more to compare against
所以最后会有类似的东西(包括作为一个紧密匹配的ID作为回顾的基础......你甚至可以将所有列作为“MatchFirstName,MatchLastName,MatchCity等”进行预览。 ..)
SELECT
originalTable.id,
originalTable.lastname,
originalTable.firstname,
originalTable.address,
originalTable.city,
originalTable.email,
compareTable.ID as MatchID
FROM
address as originalTable,
address as compareTable
WHERE
originalTable.ID < CompareTable.ID
AND left( originalTable.LastName, 3 ) = left( CompareTable.LastName, 3 )
AND (originalTable.firstname = compareTable.firstname)
+ (originalTable.lastname = compareTable.lastname)
+ (originalTable.address = compareTable.address and originalTable.address != '')
+ (originalTable.city = compareTable.city and originalTable.city != '')
+ (originalTable.email = compareTable.email and originalTable.email != '') >= 3