MySQL efficiently mark duplicates in a large table

时间:2017-08-10 13:10:11

标签: mysql duplicates sql-update

I am working on a large table with many events getting data in. I want to check specific column (text or varchar) inside a single event for duplicates and in every row with duplicate occurring set column duplicity to 1. As there are currently over 100 000 lines in the table, with about 30 000 lines belonging to the event in question, any joins with subqueries take minutes to finish.

Here is what I came up with so far; it works but still takes several seconds to finish and I'd like to learn a more efficient solution. Also it feels too bulky and ugly for this relatively easy task.

DROP TEMPORARY TABLE IF EXISTS table2
;
CREATE TEMPORARY TABLE table2 AS (SELECT * FROM table WHERE ide = 123)
;
DROP TEMPORARY TABLE IF EXISTS table3
;
CREATE TEMPORARY TABLE table3 AS (SELECT id,odpoved FROM table
    WHERE ide = 123
    GROUP BY text_column
    HAVING COUNT(*) > 1)
;
UPDATE (
    SELECT all.id id FROM table3 txt
    INNER JOIN table2 all ON all.text_column = txt.text_column
) a 
INNER JOIN table main ON main.id = a.id
SET main.duplicity = 1

This currently takes about 8 seconds, I expect the amount of data in the event to at least triple shortly.

I cannot modify the existing database or table structure.

My previous approach - nicer, but took about 4 minutes on the current data set:

UPDATE table t1
JOIN (
  SELECT id,text_column FROM table
    WHERE ide = 123
    GROUP BY text_column
    HAVING COUNT(*) > 1) t2
ON t1.text_column = t2.text_column
SET t1.duplicity = 1

0 个答案:

没有答案