我试图找到来自不同表格的行之间的相似性。这是DDL。
CREATE TABLE a (id int, fname text, lname text, email text, phone text);
INSERT INTO a VALUES
(1, 'john', 'doe', 'john@gmail.com', null),
(2, 'peter', 'green', 'peter@gmail.com', null);
CREATE TABLE b (id int, fname text, lname text, email text, phone text);
INSERT INTO b VALUES
(null, 'peter', 'glover', 'bob@gmail.com', '777'),
(null, null, 'green', 'peter@gmail.com', '666');
假设我们有以下相似配置
fname = 0.1
lname = 0.3
email = 0.5
phone = 0.5
所以我们可以说
之间相似(2, 'peter', 'green', 'peter@gmail.com', null) and
(null, null, 'green', 'peter@gmail.com', '666') is 0.8 (lname + email)
(2, 'peter', 'green', 'peter@gmail.com', null) and
(null, 'peter', 'glover', 'bob@gmail.com', '777') is 0.1 (fname)
因此我希望从表b获得与表相似的数据超过某个阈值(假设为0.7)。所以根据例子,我需要得到这样的smt
id, fname, lname, email, phone, similarity
2, null,'green', 'peter@gmail.com', '666', 0.8
其中id是来自表a的类似行的id
我已经尝试过NATURAL FULL OUTER JOIN和EXCEPT,但它不能用于我的目的,或者我只是做错了。 什么样的索引适合查询?因为表a可能有十亿行
更新 目标是匹配行。所以可能会更好地将所有信息存储在一个表中并执行窗口功能?逻辑将是相同的,依赖于相似性配置
id | fname | lname | email | phone
---+-------+--------+-----------------+-------
1 | john | doe | john@gmail.com |
2 | peter | green | peter@gmail.com |
| peter | glover | bob@gmail.com | 777
| | green | peter@gmail.com | 666
经过一些操作后,id为null的行应该用行id填充,具有最高的相似度且大于0.7,否则生成一个新的
答案 0 :(得分:0)
-- get similarity betweena and b tables
with with_similarity as (
select
a.id, b.id as tmp_id, b.fname, b.lname, b.email, b.phone,
( coalesce((a.fname = b.fname)::int, 0) * 0.1 +
coalesce((a.lname = b.lname)::int, 0) * 0.3 +
coalesce((a.email = b.email)::int, 0) * 0.5 +
coalesce((a.phone = b.phone)::int, 0) * 0.5
) as similarity
from b
cross join a
),
-- as we have matched weight for all rows, we can pickup rank them
matched as (
select *,
ROW_NUMBER() OVER(PARTITION BY tmp_id ORDER BY similarity DESC) AS rk
from with_similarity
)
-- pick up best match and insert matched + not matched rows
select id, fname, lname, email, phone from matched where rk=1 and similarity >= 0.7
union all
select tmp_id, fname, lname, email, phone from matched where similarity < 0.7 and rk = 1;