仅使用MySQL查询删除重复项?

时间:2010-08-01 21:55:48

标签: sql mysql

我有一个包含以下列的表:

URL_ID    
URL_ADDR    
URL_Time

我想使用MySQL查询删除URL_ADDR列上的重复项。

是否可以在不使用任何编程的情况下进行此类操作?

7 个答案:

答案 0 :(得分:31)

考虑以下测试用例:

CREATE TABLE mytb (url_id int, url_addr varchar(100));

INSERT INTO mytb VALUES (1, 'www.google.com');
INSERT INTO mytb VALUES (2, 'www.microsoft.com');
INSERT INTO mytb VALUES (3, 'www.apple.com');
INSERT INTO mytb VALUES (4, 'www.google.com');
INSERT INTO mytb VALUES (5, 'www.cnn.com');
INSERT INTO mytb VALUES (6, 'www.apple.com');

我们的测试表现在包含:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      4 | www.google.com    |
|      5 | www.cnn.com       |
|      6 | www.apple.com     |
+--------+-------------------+
5 rows in set (0.00 sec)

然后我们可以使用多表DELETE语法,如下所示:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (t2.url_addr = t1.url_addr AND t2.url_id > t1.url_id);

...将删除重复的条目,只留下基于url_id的第一个网址:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      5 | www.cnn.com       |
+--------+-------------------+
3 rows in set (0.00 sec)

更新 - 继续上述新评论:

如果重复的网址格式不同,您可能需要应用REPLACE()功能删除www.http://部分。例如:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (REPLACE(t2.url_addr, 'www.', '') = 
                   REPLACE(t1.url_addr, 'www.', '') AND 
                   t2.url_id > t1.url_id);

答案 1 :(得分:8)

您可能想尝试http://labs.creativecommons.org/2010/01/12/removing-duplicate-rows-in-mysql/中提到的方法。

ALTER IGNORE TABLE your_table ADD UNIQUE INDEX `tmp_index` (URL_ADDR);

答案 2 :(得分:5)

这将为特定URL_ID

留下最高URL_ADDR的广告
DELETE FROM table
WHERE URL_ID NOT IN 
    (SELECT ID FROM 
       (SELECT MAX(URL_ID) AS ID 
        FROM table 
        WHERE URL_ID IS NOT NULL
        GROUP BY URL_ADDR ) X)   /*Sounds like you would need to GROUP BY a 
                                   calculated form - e.g. using REPLACE to 
                                  strip out www see Daniel's answer*/

(派生表'X'是avoid the error“你不能在FROM子句中为更新指定目标表'tablename'”)

答案 3 :(得分:3)

嗯,你总是可以:

  1. 创建一个临时表;
  2. INSERT INTO ... SELECT DISTINCT从原始表中进入临时表;
  3. 清除原始表
  4. INSERT INTO ... SELECT从临时表
  5. 进入原始表
  6. drop temp table。
  7. 它笨拙而且很笨拙,并且需要多次查询(更不用说特权)了,但如果找不到其他解决方案,它就会起作用。

答案 4 :(得分:1)

Daniel Vassallo如何使用多列?

DELETE t2 FROM directory1 t1 JOIN directory1 t2 ON (t2.page = t1.page, t2.parentTopic = t1.parentTopic, t2.title = t1.title, t2.description = t1.description, t2.linktype = t1.linktype, t2.priority = t1.priority AND t2.linkID > t1.linkID);

也许是这样的?

答案 5 :(得分:0)

您可以在URL_ADDR上进行分组,这将有效地为您提供URL_ADDR字段中的不同值。

select 
 URL_ID
 URL_ADDR
 URL_Time
from
 some_table
group by
 URL_ADDR

享受!

答案 6 :(得分:0)

如果您的URL_ID列是唯一的,这将有效。

DELETE FROM url WHERE URL_ID IN (
SELECT URL_ID
FROM url a INNER JOIN (
    SELECT URL_ADDR, MAX(URL_ID) MaxURLId 
    FROM url
    GROUP BY URL_ADDR
    HAVING COUNT(*) > 1) b ON a.URL_ID <> b.MaxURLId AND a.URL_ADDR = b.URL_ADDR
)