我在表格中有一列包含电子邮件列表。我需要保留表格中类似电子邮件列表中的最新电子邮件。例如。如果表包含如下元素:
+--------------------------+------------------------+
| Email | Received at |
+--------------------------+------------------------+
| aespinola@aaa.com | 2016-08-04 20:56:53+00 |
| aespinola@aaa.co | 2016-08-04 20:56:52+00 |
| aespinola@aaa | 2016-08-04 20:56:51+00 |
| tracy-lee.danie@lobsterink.com | 2016-08-04 10:56:53+00 |
| trac@lobsterink.com | 2016-08-04 10:56:52+00 |
| accounts@abc.com | 2016-08-04 06:57:32+00 |
| accounts@abc.com.au | 2016-08-04 06:57:46+00 |
| Mahendra.chouhan@xyz.com | 2016-08-04 13:54:42+00 |
+--------------------------+------------------------+
最终输出应如下:
+--------------------------+------------------------+
| Email | Received at |
+--------------------------+------------------------+
| aespinola@aaa.com | 2016-08-04 20:56:53+00 |
| tracy-lee.danie@lobsterink.com | 2016-08-04 10:56:53+00 |
| accounts@abc.com.au | 2016-08-04 06:57:46+00 |
| Mahendra.chouhan@xyz.com | 2016-08-04 13:54:42+00 |
+--------------------------+------------------------+
使用下面的链接,我能够找出哪些电子邮件彼此相似。将它们分组是下一步。我无法弄明白。
Finding similar strings with PostgreSQL quickly
更新: 我添加了代码,该代码在评论中要求的电子邮件之间给出了相似性:
CREATE EXTENSION pg_trgm;
DROP TABLE IF EXISTS roshan_email_list;
CREATE TEMPORARY TABLE roshan_email_list AS (
SELECT EXTRACT(MONTH
FROM received_at) AS MONTH, EXTRACT(YEAR
FROM received_at) AS YEAR,
email
FROM users
group by month, year, email
);
CREATE INDEX roshan_email_list_gist ON roshan_email_list
USING gist(email gist_trgm_ops);
SELECT set_limit(0.75);
-- The below query gives the similarity between emails
WITH email_similarity AS
(
SELECT similarity(n1.email, n2.email) AS sim,
n1.email AS email, n2.email AS similar_email,
n1.month, n1.year
FROM roshan_email_list n1
JOIN roshan_email_list n2 ON n1.email <> n2.email AND n1.email % n2.email AND n1.month = n2.month AND n1.year = n2.year
WHERE n1.year = 2016
ORDER BY sim DESC
)
SELECT e.sim, e.email, u.received_at,
e.similar_email, e.month, e.year
FROM email_similarity e
INNER JOIN callinize.users u ON e.email = u.email;
答案 0 :(得分:0)
不确定您的整个数据集。
with data as (
select
split_part(email, '@', 1) as first,
split_part(split_part(email, '@', 2), '.', 1) as second,
received_at,
email
from emails
),
ndata as (
select *,
row_number() over (partition by first, second order by received_at desc)
from data
)
select
email, received_at
from ndata
where row_number = 1;