我有一个相当疯狂的查询,用于查找所有除了FIRST 记录以及重复值。运行38000条记录需要相当长的时间;大约50秒。
UPDATE exr_exrresv
SET mh_duplicate = 1
WHERE exr_exrresv._id IN
(
SELECT F._id
FROM exr_exrresv AS F
WHERE Exists
(
SELECT PHONE_NUMBER,
Count(_id)
FROM exr_exrresv
WHERE exr_exrresv.PHONE_NUMBER = F.PHONE_NUMBER
AND exr_exrresv.PHONE_NUMBER != ''
AND mh_active = 1 AND mh_duplicate = 0
GROUP BY exr_exrresv.PHONE_NUMBER
HAVING Count(exr_exrresv._id) > 1)
)
AND exr_exrresv._id NOT IN
(
SELECT Min(_id)
FROM exr_exrresv AS F
WHERE Exists
(
SELECT PHONE_NUMBER,
Count(_id)
FROM exr_exrresv
WHERE exr_exrresv.PHONE_NUMBER = F.PHONE_NUMBER
AND exr_exrresv.PHONE_NUMBER != ''
AND mh_active = 1
AND mh_duplicate = 0
GROUP BY exr_exrresv.PHONE_NUMBER
HAVING Count(exr_exrresv._id) > 1
)
GROUP BY PHONE_NUMBER
);
有关如何优化它或我应该如何开始的任何提示?我已经检查了查询计划,但我真的不确定如何开始改进它。临时表?更好的查询?
以下是解释查询计划输出:
0|0|0|SEARCH TABLE exr_exrresv USING INTEGER PRIMARY KEY (rowid=?) (~12 rows)
0|0|0|EXECUTE LIST SUBQUERY 0
0|0|0|SCAN TABLE exr_exrresv AS F (~500000 rows)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE exr_exrresv USING AUTOMATIC COVERING INDEX (PHONE_NUMBER=? AND mh_active=? AND mh_duplicate=?) (~7 rows)
1|0|0|USE TEMP B-TREE FOR GROUP BY
0|0|0|EXECUTE LIST SUBQUERY 2
2|0|0|SCAN TABLE exr_exrresv AS F (~500000 rows)
2|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 3
3|0|0|SEARCH TABLE exr_exrresv USING AUTOMATIC COVERING INDEX (PHONE_NUMBER=? AND mh_active=? AND mh_duplicate=?) (~7 rows)
3|0|0|USE TEMP B-TREE FOR GROUP BY
2|0|0|USE TEMP B-TREE FOR GROUP BY
任何提示都会非常感激。 :)
另外,我使用Ruby来进行sql查询,所以如果逻辑离开SQL并用Ruby编写更有意义,这是可能的。
架构如下,您可以在这里使用sqlfiddle:http://sqlfiddle.com/#!2/2c07e
_id INTEGER PRIMARY KEY
OPPORTUNITY_ID varchar(50)
CREATEDDATE varchar(50)
FIRSTNAME varchar(50)
LASTNAME varchar(50)
MAILINGSTREET varchar(50)
MAILINGCITY varchar(50)
MAILINGSTATE varchar(50)
MAILINGZIPPOSTALCODE varchar(50)
EMAIL varchar(50)
CONTACT_PHONE varchar(50)
PHONE_NUMBER varchar(50)
CallFromWeb varchar(50)
OPPORTUNITY_ORIGIN varchar(50)
PROJECTED_LTV varchar(50)
MOVE_IN_DATE varchar(50)
mh_processed_date varchar(50)
mh_control INTEGER
mh_active INTEGER
mh_duplicate INTEGER
答案 0 :(得分:1)
根据您的帖子猜测,如果不是带有该电话号码的第一条记录,您似乎正在尝试为具有相同电话号码的任何记录更新mh_duplicate
列?
如果这是正确的,我认为这应该让你更新id(你可能需要添加适当的where标准) - 从那里,更新是直截了当的:
SELECT e._Id
FROM exr_exrresv e
JOIN
( SELECT t.Phone_Number
FROM exr_exrresv t
GROUP BY t.Phone_Number
HAVING COUNT (t.Phone_Number) > 1
) e2 ON e.Phone_Number = e2.Phone_Number
LEFT JOIN
( SELECT MIN(t2._Id) as KeepId
FROM exr_exrresv t2
GROUP BY t2.Phone_Number
) e3 ON e._Id = e3.KeepId
WHERE e3.KeepId is null
祝你好运。
答案 1 :(得分:1)
如果存在具有匹配的phone_number和较小的_id的活动记录,则认为是重复记录。 (不需要分组或计数。)
update exr_exrresv
set mh_duplicate = 1
where exr_exrresv._id in (
select target._id
from exr_exrresv as target
where target.phone_number != ''
and target.mh_active = 1
and exists (
select null from exr_exrresv as probe
where probe.phone_number = target.phone_number
and probe.mh_active = 1
and probe._id < target._id
)
)
如果phone_number上存在索引,理想情况下在exr_exrresv (phone_number, _id)
上