Question

I have a PostgreSQL users table, where one of the columns is email. Our problem is that some users have entered the same email in two ways by mistake, that differ only by whitespace, e.g. "me@xample.com" and " me@example.com". I would like to delete all the rows that have such an email with whitespace.

Here is the query I want to use:

DELETE FROM ONLY users
WHERE id IN (
  SELECT users1.id
  FROM users AS users1, users AS users2
  WHERE users1.id != users2.id
      AND trim(both from users1.email) = users2.email)

Unfortunately this query is very slow (O(n^2) I believe because of the cross-join), and I really need a way to speed it up so we don't bog down our database.

Answer 1

To begin with, you can use a correlated subquery. But, I think you should approach this using window functions:

DELETE FROM ONLY users u
    FROM (SELECT u2.*,
                 ROW_NUMBER() OVER (PARTITION BY trim(both from u2.email) ORDER BY LENGTH(u2.email) ASC) as seqnum
          FROM users u2
         )
    WHERE u2.id = u.id AND seqnum > 1;

This deletes all but the shortest emails that are equivalent, modulo spaces.

(Note: test this out on sample data before running it on a big table.)

Answer 2

Use a self join to find out all users who have more than 1 email id with any extra characters in one of the emails (i assume there will only be spaces as extra characters). Then delete those rows from the original table.

delete from users
where (id, email) in (select u1.id,u1.email 
                      from users u1
                      join users u2 on u1.id = u2.id
                      where char_length(u1.email) - char_length(u2.email) >= 1)

Edit: The simplest way to do it is

delete from users
where length(trim(email)) <> length(email)

How can I delete all emails in a Postgres table that differ only by whitespace?

2 个答案: