Mysql匹配“相同”电子邮件

时间:2018-07-25 19:15:33

标签: mysql sql regex self-join

我有一个包含2列emailid的表。我需要找到密切相关的电子邮件。例如:

john.smith12@example.com

john.smith12@some.subdomains.example.com

应将它们视为相同,因为用户名(john.smith12)和最顶级域(example.com)是相同的。它们当前是我表中的2个不同行。 我写了下面的表达式,该表达式应该进行比较,但是要花几个小时才能执行(可能是因为正则表达式)。有没有更好的方法来写这个:

  select c1.email, c2.email 
  from table as c1
  join table as c2
   on (
             c1.leadid <> c2.leadid 
        and 
             c1.email regexp replace(replace(c2.email, '.', '[.]'), '@', '@[^@]*'))

对此查询的解释为:

id, select_type, table, type, possible_keys, key, key_len, ref,  rows,   Extra
1,  SIMPLE,      c1,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, NULL
1,  SIMPLE,      c2,    ALL,   NULL,         NULL,  NULL,  NULL, 577532, Using where; Using join buffer (Block Nested Loop)

创建表为:

CREATE TABLE `table` (
 `ID` int(11) NOT NULL AUTO_INCREMENT,
 `Email` varchar(100) DEFAULT NULL,
 KEY `Table_Email` (`Email`),
 KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

我猜因为正则表达式,索引没有被使用。

正则表达式显示为:

john[.]smith12@[^@]*example[.]com

应匹配两个地址。

更新

我将on修改为:

on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('@', c1.email) -1)) = substr(c2. email, 1, (locate('@', c2.email) -1))
and    
substr(c1.email, locate('@', c1.email) + 1) like concat('%', substr(c2.email, locate('@', c2.email) + 1)))

和使用这种方法的explain至少使用索引。

id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)

到目前为止,该操作已执行了5分钟,如果有很大的改进,它将更新。

更新2:

我已经拆分了电子邮件,因此用户名是一列,域是一列。我以相反的顺序存储域,因此该域的索引可以与尾随通配符一起使用。

CREATE TABLE `table` (
     `ID` int(11) NOT NULL AUTO_INCREMENT,
     `Email` varchar(100) DEFAULT NULL,
     `domain` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
     `username` varchar(500) CHARACTER SET utf8 DEFAULT NULL,
     KEY `Table_Email` (`Email`),
     KEY `Email` (`Email`),
     KEY `domain` (`domain`)
    ) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1

查询以填充新列:

update table
set username = trim(SUBSTRING_INDEX(trim(email), '@', 1)), 
domain = reverse(trim(SUBSTRING_INDEX(SUBSTRING_INDEX(trim(email), '@', -1), '.', -3)));

新查询:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
join table as c2
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

新的解释结果:

1, SIMPLE, c1, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, table_Email,Email, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

从该解释来看,好像没有使用domain索引。我还尝试用USE强制使用,但这也行不通,导致没有使用索引:

select c1.email, c2.email, c2.domain, c1.domain, c1.username, c2.username, c1.leadid, c2.leadid
from table as c1
USE INDEX (domain)
join table as c2
USE INDEX (domain)
on (c1.email is not null and c2.email is not null and c1.leadid <> c2.leadid
    and c1.username = c2.username and c1.domain like concat(c2.domain, '%'))

解释use

1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 649173, Using where
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 649173, Using where; Using join buffer (Block Nested Loop)

4 个答案:

答案 0 :(得分:2)

您告诉我们该表有70万行。

这不是很多,但是您将其自身连接起来,因此在最坏的情况下,引擎将不得不处理700K * 700K = 490 000 000 000 = 490B行。

索引绝对可以为您提供帮助。

最佳索引取决于数据分布。

以下查询返回什么?

SELECT COUNT(DISTINCT username) 
FROM table

如果结果接近700K,例如100K,则意味着存在许多不同的用户名,因此您最好将重点放在用户名上,而不要关注domain。如果结果很低(例如100),则比索引username不太有用。

我希望有很多不同的用户名,因此,我将在username上创建索引,因为查询使用简单的相等比较在该列上联接,并且该联接将从该索引中受益匪浅。

要考虑的另一种选择是(username, domain)上的复合索引,甚至覆盖(username, domain, leadid, email)上的索引。索引定义中列的顺序很重要。

我将删除所有其他索引,以使优化器无法做出其他选择,除非还有其他查询可能需要它们。

在表上定义主键也很可能不会受到损害。


还有另外一件不太重要的事情要考虑。您的数据真的有NULL吗?如果不是,则将列定义为NOT NULL。另外,在很多情况下,最好使用空字符串而不是NULL,除非您有非常具体的要求并且必须区分NULL和''

查询会稍微简单一些:

select 
    c1.email, c2.email, 
    c1.domain, c2.domain, 
    c1.username, c2.username, 
    c1.leadid, c2.leadid
from 
    table as c1
    join table as c2
        on  c1.username = c2.username 
        and c1.domain like concat(c2.domain, '%')
        and c1.leadid <> c2.leadid

答案 1 :(得分:1)

不需要REGEXP_REPLACE,因此它可以在所有版本的MySQL / MariaDB中使用:

UPDATE tbl
    SET email = CONCAT(SUBSTRING_INDEX(email, '@', 1),
                       '@',
                       SUBSTRING_INDEX(
                           SUBSTRING_INDEX(email, '@', -1),
                           '.',
                           -2);

由于没有索引是有用的,因此您最好不要使用WHERE子句。

答案 2 :(得分:0)

如果搜索相关数据,则应查看一些数据挖掘工具或Elastic Search,例如,它们的工作原理与您所需的一样。

我还有另一个可能的“仅数据库”解决方案,但是我不知道它是否可行,或者它是否是最佳解决方案。如果必须这样做,我将尝试创建一个“单词参考”表,并用所有非字母数字字符将所有电子邮件分开。

在您的示例中,该表将填充:john,smith12,一些子域,example和com。每个单词都有唯一的ID。然后,另一个表,联合表,它将用自己的单词链接电子邮件。在两个表上都需要索引。

要搜索密切相关的电子邮件,您必须使用正则表达式拆分源电子邮件,并在每个子词like this one in the answer(带有连接符)上循环,然后为每个词在词中找到它引用表,然后是联合表以查找与其匹配的电子邮件。

对于此请求,您可以进行选择,将所有匹配的电子邮件相加,方法是按电子邮件分组以计算找到的电子邮件所匹配的单词数,并仅保留最匹配的电子邮件(当然不包括原始电子邮件)。 / p>

对这个“不确定答案”很抱歉,但是评论太久了。我将尝试举一个例子。


这是一个示例(在oracle中,但应与MySQL一起使用),其中包含一些数据:

---------------------------------------------
-- Table containing emails and people info
CREATE TABLE PEOPLE (
     ID NUMBER(11) PRIMARY KEY NOT NULL,
     EMAIL varchar2(100) DEFAULT NULL,
     USERNAME varchar2(500) DEFAULT NULL
);

-- Table containing word references
CREATE TABLE WORD_REF (
     ID number(11) NOT NULL PRIMARY KEY,
     WORD varchar2(20) DEFAULT NULL
);

-- Table containg id's of both previous tables
CREATE TABLE UNION_TABLE (
     EMAIL_ID number(11) NOT NULL,
     WORD_ID number(11) NOT NULL,
     CONSTRAINT EMAIL_FK FOREIGN KEY (EMAIL_ID) REFERENCES PEOPLE (ID),
     CONSTRAINT WORD_FK FOREIGN KEY (WORD_ID) REFERENCES WORD_REF (ID)
);

-- Here is my oracle sequence to simulate the auto increment
CREATE SEQUENCE MY_SEQ
  MINVALUE 1
  MAXVALUE 999999
  START WITH 1
  INCREMENT BY 1
  CACHE 20;

---------------------------------------------
-- Some data in the people table
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12@example.com', 'jsmith12');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.smith12@some.subdomains.example.com', 'admin');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'john.doe@another.domain.eu', 'jdo');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'nathan.smith@example.domain.com', 'nsmith');
INSERT INTO PEOPLE (ID, EMAIL, USERNAME) VALUES (MY_SEQ.NEXTVAL, 'david.cayne@some.domain.st', 'davidcayne');
COMMIT;

-- Word reference data from the people data
INSERT INTO WORD_REF (ID, WORD) 
  (select MY_SEQ.NEXTVAL, WORD FROM
   (select distinct REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
    from PEOPLE
    CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL
  ));
COMMIT;

-- Union table filling
INSERT INTO UNION_TABLE (EMAIL_ID, WORD_ID)
select words.ID EMAIL_ID, word_ref.ID WORD_ID
FROM 
(select distinct ID, REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) WORD
 from PEOPLE
 CONNECT BY REGEXP_SUBSTR(EMAIL, '\w+',1,LEVEL) IS NOT NULL) words
left join WORD_REF on word_ref.word = words.WORD;
COMMIT;    

---------------------------------------------
-- Finaly, the request which orders the emails which match the source email 'john.smith12@example.com'
SELECT COUNT(1) email_match
      ,email
FROM   (SELECT word_ref.id
              ,words.word
              ,uni.email_id
              ,ppl.email
        FROM   (SELECT DISTINCT regexp_substr('john.smith12@example.com'
                                             ,'\w+'
                                             ,1
                                             ,LEVEL) word
                FROM   dual
                CONNECT BY regexp_substr('john.smith12@example.com'
                                        ,'\w+'
                                        ,1
                                        ,LEVEL) IS NOT NULL) words
        LEFT   JOIN word_ref
        ON     word_ref.word = words.word
        LEFT   JOIN union_table uni
        ON     uni.word_id = word_ref.id
        LEFT   JOIN people ppl
        ON     ppl.id = uni.email_id)
WHERE  email <> 'john.smith12@example.com'
GROUP  BY email_match DESC;

请求结果:

    4    john.smith12@some.subdomains.example.com
    2    nathan.smith@example.domain.com
    1    john.doe@another.domain.eu

答案 3 :(得分:0)

您可以通过

获得名称(即“ @”之前的部分)
substring_index(email, '@', 1)

您通过以下方式获取域

substring_index(replace(email, '@', '.'), '.', -2))

(因为如果我们将'@'替换为一个点,那么它始终是倒数第二个点之后的部分)。

因此,您发现重复项

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and substring_index(other.email, '@', 1) = 
        substring_index(users.email, '@', 1)
    and substring_index(replace(other.email, '@', '.'), '.', -2) =
        substring_index(replace(users.email, '@', '.'), '.', -2)
);

如果这太慢了,那么您可能要在两者的结合上创建一个计算列并为其编制索引:

alter table users add main_email as 
  concat(substring_index(email, '@', 1), '@', substring_index(replace(email, '@', '.'), '.', -2));

create index idx on users(main_email);

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and other.main_email = users.main_email
);

当然,您也可以将两者分开并为其编制索引:

alter table users add email_name as substring_index(email, '@', 1);
alter table users add email_domain as substring_index(replace(email, '@', '.'), '.', -2);

create index idx on users(email_name, email_domain);

select *
from users
where exists
(
  select *
  from mytable other
  where other.id <> users.id
    and other.email_name = users.email_name
    and other.email_domain = users.email_dome
);

当然,如果您在电子邮件地址列中同时使用大写和小写,则还需要在上面的表达式(LOWER)中对其应用lower(email)