让我们说我正在创建一个地址簿,其中主表包含基本联系信息和电话号码子表 -
Contact
===============
Id [PK]
Name
PhoneNumber
===============
Id [PK]
Contact_Id [FK]
Number
因此,Contact记录可能在PhoneNumber表中有零个或多个相关记录。除主键之外的任何列的唯一性都没有约束。事实上,这必须是真的,因为:
我想将可能包含重复记录的大型数据集导入到我的数据库中,然后使用SQL过滤掉重复项。识别重复记录的规则很简单......它们必须共享相同名称和相同数量的具有相同内容的电话记录。
当然,这对于从Contact表中选择重复项非常有效,但是根据我的规则,它不能帮助我检测实际的重复项:
SELECT * FROM Contact
WHERE EXISTS
(SELECT 'x' FROM Contact t2
WHERE t2.Name = Contact.Name AND
t2.Id > Contact.Id);
似乎我想要的是对我已经拥有的东西的逻辑延伸,但我必须忽视它。有什么帮助吗?
谢谢!
答案 0 :(得分:1)
在我的问题中,我创建了一个大大简化的模式,反映了我正在解决的现实问题。 Przemyslaw的答案确实是一个正确的答案,并且做了我用样本架构和扩展后的真实问题。
但是,在对真实模式和更大(~10k记录)数据集进行一些实验后,我发现性能是一个问题。我并不认为自己是一名索引大师,但我无法找到比模式中已有的更好的索引组合。
所以,我提出了一个替代解决方案,它满足相同的要求,但只在一小部分(<10%)的时间内执行,至少使用SQLite3 - 我的生产引擎。希望它可以帮助别人,我会把它作为我问题的替代答案。
DROP TABLE IF EXISTS Contact;
DROP TABLE IF EXISTS PhoneNumber;
CREATE TABLE Contact (
Id INTEGER PRIMARY KEY,
Name TEXT
);
CREATE TABLE PhoneNumber (
Id INTEGER PRIMARY KEY,
Contact_Id INTEGER REFERENCES Contact (Id) ON UPDATE CASCADE ON DELETE CASCADE,
Number TEXT
);
INSERT INTO Contact (Id, Name) VALUES
(1, 'John Smith'),
(2, 'John Smith'),
(3, 'John Smith'),
(4, 'Jane Smith'),
(5, 'Bob Smith'),
(6, 'Bob Smith');
INSERT INTO PhoneNumber (Id, Contact_Id, Number) VALUES
(1, 1, '555-1212'),
(2, 1, '222-1515'),
(3, 2, '222-1515'),
(4, 2, '555-1212'),
(5, 3, '111-2525'),
(6, 4, '111-2525');
COMMIT;
SELECT *
FROM Contact c1
WHERE EXISTS (
SELECT 1
FROM Contact c2
WHERE c2.Id > c1.Id
AND c2.Name = c1.Name
AND (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c2.Id) = (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c1.Id)
AND (
SELECT COUNT(*)
FROM PhoneNumber p1
WHERE p1.Contact_Id = c2.Id
AND EXISTS (
SELECT 1
FROM PhoneNumber p2
WHERE p2.Contact_Id = c1.Id
AND p2.Number = p1.Number
)
) = (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c1.Id)
)
;
结果如预期:
Id Name
====== =============
1 John Smith
5 Bob Smith
其他发动机必然具有不同的性能,这可能是完全可以接受的。对于这种模式,这个解决方案似乎与SQLite非常兼容。
答案 1 :(得分:0)
关键字“有”是您的朋友。通用用途是:
select field1, field2, count(*) records
from whereever
where whatever
group by field1, field2
having records > 1
是否可以在having子句中使用别名取决于数据库引擎。您应该能够将这一基本原则应用于您的情况。
答案 2 :(得分:0)
作者表示要求“两个人是同一个人”:
所以这个问题比它看起来要复杂得多(或者我刚刚推翻它)。
示例数据和(一个丑陋的,我知道,但总的想法在那里)我在下面的测试数据上测试的示例查询似乎正常工作(我正在使用Oracle 11g R2):
CREATE TABLE contact (
id NUMBER PRIMARY KEY,
name VARCHAR2(40))
;
CREATE TABLE phone_number (
id NUMBER PRIMARY KEY,
contact_id REFERENCES contact (id),
phone VARCHAR2(10)
);
INSERT INTO contact (id, name) VALUES (1, 'John');
INSERT INTO contact (id, name) VALUES (2, 'John');
INSERT INTO contact (id, name) VALUES (3, 'Peter');
INSERT INTO contact (id, name) VALUES (4, 'Peter');
INSERT INTO contact (id, name) VALUES (5, 'Mike');
INSERT INTO contact (id, name) VALUES (6, 'Mike');
INSERT INTO contact (id, name) VALUES (7, 'Mike');
INSERT INTO phone_number (id, contact_id, phone) VALUES (1, 1, '123'); -- John having number 123
INSERT INTO phone_number (id, contact_id, phone) VALUES (2, 1, '456'); -- John having number 456
INSERT INTO phone_number (id, contact_id, phone) VALUES (3, 2, '123'); -- John the second having number 123
INSERT INTO phone_number (id, contact_id, phone) VALUES (4, 2, '456'); -- John the second having number 456
INSERT INTO phone_number (id, contact_id, phone) VALUES (5, 3, '123'); -- Peter having number 123
INSERT INTO phone_number (id, contact_id, phone) VALUES (6, 3, '456'); -- Peter having number 123
INSERT INTO phone_number (id, contact_id, phone) VALUES (7, 3, '789'); -- Peter having number 123
INSERT INTO phone_number (id, contact_id, phone) VALUES (8, 4, '456'); -- Peter the second having number 456
INSERT INTO phone_number (id, contact_id, phone) VALUES (9, 5, '123'); -- Mike having number 456
INSERT INTO phone_number (id, contact_id, phone) VALUES (10, 5, '456'); -- Mike having number 456
INSERT INTO phone_number (id, contact_id, phone) VALUES (11, 6, '123'); -- Mike the second having number 456
INSERT INTO phone_number (id, contact_id, phone) VALUES (12, 6, '789'); -- Mike the second having number 456
-- Mike the third having no number
COMMIT;
-- does not meet the requirements described in the question - will return Peter when it should not
SELECT DISTINCT c.name
FROM contact c JOIN phone_number pn ON (pn.contact_id = c.id)
GROUP BY name, phone_number
HAVING COUNT(c.id) > 1
;
-- returns correct results for provided test data
-- take all people that have a namesake in contact table and
-- take all this person's phone numbers that this person's namesake also has
-- finally (outer query) check that the number of both persons' phone numbers is the same and
-- the number of the same phone numbers is equal to the number of (either) person's phone numbers
SELECT c1_id, name
FROM (
SELECT c1.id AS c1_id, c1.name, c2.id AS c2_id, COUNT(1) AS cnt
FROM contact c1
JOIN contact c2 ON (c2.id != c1.id AND c2.name = c1.name)
JOIN phone_number pn ON (pn.contact_id = c1.id)
WHERE
EXISTS (SELECT 1
FROM phone_number
WHERE contact_id = c2.id
AND phone = pn.phone)
GROUP BY c1.id, c1.name, c2.id
)
WHERE cnt = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id)
AND (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c2_id)
;
-- cleanup
DROP TABLE phone_number;
DROP TABLE contact;
检查SQL Fiddle:http://www.sqlfiddle.com/#!4/36cdf/1
<强>被修改强>
回答作者的评论:我当然没有考虑到这一点......这是一个经过修改的解决方案:
-- new test data
INSERT INTO contact (id, name) VALUES (8, 'Jane');
INSERT INTO contact (id, name) VALUES (9, 'Jane');
SELECT c1_id, name
FROM (
SELECT c1.id AS c1_id, c1.name, c2.id AS c2_id, COUNT(1) AS cnt
FROM contact c1
JOIN contact c2 ON (c2.id != c1.id AND c2.name = c1.name)
LEFT JOIN phone_number pn ON (pn.contact_id = c1.id)
WHERE pn.contact_id IS NULL
OR EXISTS (SELECT 1
FROM phone_number
WHERE contact_id = c2.id
AND phone = pn.phone)
GROUP BY c1.id, c1.name, c2.id
)
WHERE (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) IN (0, cnt)
AND (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c2_id)
;
我们允许没有电话号码的情况(LEFT JOIN),在外部查询中我们现在比较人的电话号码的数量 - 它必须等于0,或者从内部查询返回的数字。