postgres在两个表之间找到类似的行

时间:2018-04-19 21:26:37

标签: database postgresql similarity

我试图找到来自不同表格的行之间的相似性。这是DDL。

CREATE TABLE a (id int, fname text, lname text, email text, phone text);
INSERT INTO a VALUES 
(1, 'john', 'doe', 'john@gmail.com', null), 
(2, 'peter', 'green', 'peter@gmail.com', null);

CREATE TABLE b (id int, fname text, lname text, email text, phone text);
INSERT INTO b VALUES
(null, 'peter', 'glover', 'bob@gmail.com', '777'),
(null, null, 'green', 'peter@gmail.com', '666');

假设我们有以下相似配置

fname = 0.1
lname = 0.3
email = 0.5
phone = 0.5

所以我们可以说

之间相似
(2, 'peter', 'green', 'peter@gmail.com', null) and
(null, null, 'green', 'peter@gmail.com', '666') is 0.8 (lname + email)

(2, 'peter', 'green', 'peter@gmail.com', null) and
(null, 'peter', 'glover', 'bob@gmail.com', '777') is 0.1 (fname)

因此我希望从表b获得与表相似的数据超过某个阈值(假设为0.7)。所以根据例子,我需要得到这样的smt

id, fname, lname, email, phone, similarity
2,  null,'green', 'peter@gmail.com', '666', 0.8

其中id是来自表a的类似行的id

我已经尝试过NATURAL FULL OUTER JOIN和EXCEPT,但它不能用于我的目的,或者我只是做错了。 什么样的索引适合查询?因为表a可能有十亿行

更新 目标是匹配行。所以可能会更好地将所有信息存储在一个表中并执行窗口功能?逻辑将是相同的,依赖于相似性配置

id | fname | lname  |      email      | phone 
---+-------+--------+-----------------+-------
 1 | john  | doe    | john@gmail.com  | 
 2 | peter | green  | peter@gmail.com |
   | peter | glover | bob@gmail.com   | 777
   |       | green  | peter@gmail.com | 666 

经过一些操作后,id为null的行应该用行id填充,具有最高的相似度且大于0.7,否则生成一个新的

1 个答案:

答案 0 :(得分:0)

-- get similarity betweena and b tables
with with_similarity as (
select 
a.id, b.id as tmp_id, b.fname, b.lname, b.email, b.phone,
( coalesce((a.fname = b.fname)::int, 0) * 0.1 +
        coalesce((a.lname = b.lname)::int, 0) * 0.3 +
        coalesce((a.email = b.email)::int, 0) * 0.5 +
        coalesce((a.phone = b.phone)::int, 0) * 0.5
) as similarity
from b
cross join a
), 
-- as we have matched weight for all rows, we can pickup rank them
matched as (
select *,
ROW_NUMBER() OVER(PARTITION BY tmp_id ORDER BY similarity DESC) AS rk
from with_similarity
)

-- pick up best match and insert matched + not matched rows
select id, fname, lname, email, phone from matched where rk=1 and similarity >= 0.7
union all
select tmp_id, fname, lname, email, phone from matched where similarity < 0.7 and rk = 1;