在我的Mac上测试PostgreSQL 9.6.2并使用Ngrams。 假设酿酒厂有一个GIN三元组指数。
相似度的限制(我知道这已被弃用):
SELECT set_limit(0.5);
我在2,3M行表上构建了一个三元组搜索。
我的选择代码:
SELECT winery, similarity(winery, 'chateau chevla blanc') AS similarity
FROM usr_wines
WHERE status=1 AND winery % 'chateau chevla blanc'
ORDER BY similarity DESC;
我的结果(在我的Mac上为329毫秒):
Chateau ChevL Blanc 0,85
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc 0,736842
Chateau Blanc, 0,736842
Chateau Blanc 0,736842
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc 0,727273
Chateau Cheval Blanc (7) 0,666667
Chateau Cheval Blanc Cbo 0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64
好吧,我不明白怎么可能" Chateau blanc"有相似性>去#C; Chateau Cheval Blanc"在这种情况下 ?据我所知,这2个字完全相同"城堡"和" blanc",但没有其他的词" cheval"。
为什么" Chateau ChevL Blanc"是第一个?一封信" a"不见了!
好吧,我的目标是在给出酒庄名称时匹配所有可能的副本,即使它被误解了。我错过了什么?
答案 0 :(得分:17)
三元组相似性的概念依赖于将任何句子分成“三元组”(三个连续字母的序列),并将结果视为一个SET(即:顺序无关紧要,你没有重复值)。在考虑句子之前,在开头添加两个空格,在结尾添加一个空格,单个空格替换为双空格。
Trigrams 是N-grams的一个特例。
通过查找出现在其上的三个字母的所有序列,找到对应于“Chateau blanc”的三元组:
chateau blanc
--- => ' c'
--- => ' ch'
--- => 'cha'
--- => 'hat'
--- => 'ate'
--- => 'tea'
--- => 'eau'
--- => 'au '
--- => 'u '
--- => ' b'
--- => ' bl'
--- => 'bla'
--- => 'lan'
--- => 'anc'
--- => 'nc '
对它们进行排序,然后取出重复项就可以了:
' b'
' c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'
这可以通过PostgreSQL通过函数show_trgm
来计算:
SELECT show_trgm('Chateau blanc') AS A
A = [ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
...有14个三元组。 (检查pg_trgm)。
对应于“Chateau Cheval Blanc”的三元组是:
SELECT show_trgm('Chateau Cheval Blanc') AS B
B = [ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
......有19个三元组
如果你计算三个三元组有多少共同点,你会发现它们有以下几个:
A intersect B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
他们总共有:
A union B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
也就是说,两个句子共有14个三元组,总共19个 相似度计算如下:
similarity = 14 / 19
您可以查看:
SELECT
cast(14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc', 'chateau chevla blanc') AS function_in_pg
您会看到: 0.736842
...解释 如何计算 相似度, 为什么 获得您获得的值。
注意:您可以通过以下方式计算交叉点和并集:
SELECT
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;
SELECT
array_agg(t) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;
这是一种探索不同句子对的相似性的方法:
WITH p AS
(
SELECT
'This is just a sentence I''ve invented'::text AS f1,
'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;
您可以在Rextester
查看答案 1 :(得分:4)
三元组算法应该越准确,相比字符串的长度差异越小。您可以修改算法以补偿长度差异的影响。
以下示例性函数将字符串长度中1个字符的差异的相似度降低1%。这意味着它有利于相同(相似)长度的字符串。
create or replace function corrected_similarity(str1 text, str2 text)
returns float4 language sql as $$
select similarity(str1, str2)* (1- abs(length(str1)-length(str2))/100.0)::float4
$$;
select
winery,
similarity(winery, 'chateau chevla blanc') as similarity,
corrected_similarity(winery, 'chateau chevla blanc') as corrected_similarity
from usr_wines
where winery % 'chateau chevla blanc'
order by corrected_similarity desc;
winery | similarity | corrected_similarity
--------------------------+------------+----------------------
Chateau ChevL Blanc | 0.85 | 0.8415
Chateau Cheval Blanc | 0.727273 | 0.727273
Chateau Cheval Blanc | 0.727273 | 0.727273
Chateau Cheval Blanc | 0.727273 | 0.727273
Chateau Blanc, | 0.736842 | 0.692632
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Blanc | 0.736842 | 0.685263
Chateau Cheval Blanc (7) | 0.666667 | 0.64
Chateau Du Cheval Blanc | 0.64 | 0.6208
Chateau Du Cheval Blanc | 0.64 | 0.6208
Chateau Cheval Blanc Cbo | 0.64 | 0.6144
(14 rows)
以类似的方式,你可以通过例如多少个初始字符是相同的来校正标准相似性(认为函数会有点复杂)。
答案 2 :(得分:0)
有时您想要与克林答案相反。在某些应用中,字符串长度的巨大差异不应导致如此显着的得分损失。
例如,想象一下一个自动完成结果表格,其中包含三字组匹配的建议,这些建议在您键入时会有所改善。
这是另一种评分匹配的方法,该方法仍然使用三字组,但更喜欢子字符串匹配。
相似性公式改为以
开始the number of common trigrams
-------------------------------------------
the number of trigrams in the shortest word <-- key difference
,并且可以根据良好的标准相似性得分从那里上升。
CREATE OR REPLACE FUNCTION substring_similarity(string_a TEXT, string_b TEXT) RETURNS FLOAT4 AS $$
DECLARE
a_trigrams TEXT[];
b_trigrams TEXT[];
a_tri_len INTEGER;
b_tri_len INTEGER;
common_trigrams TEXT[];
max_common INTEGER;
BEGIN
a_trigrams = SHOW_TRGM(string_a);
b_trigrams = SHOW_TRGM(string_b);
a_tri_len = ARRAY_LENGTH(a_trigrams, 1);
b_tri_len = ARRAY_LENGTH(b_trigrams, 1);
IF (NOT (a_tri_len > 0) OR NOT (b_tri_len > 0)) THEN
IF (string_a = string_b) THEN
RETURN 1;
ELSE
RETURN 0;
END IF;
END IF;
common_trigrams := ARRAY(SELECT UNNEST(a_trigrams) INTERSECT SELECT UNNEST(b_trigrams));
max_common = LEAST(a_tri_len, b_tri_len);
RETURN COALESCE(ARRAY_LENGTH(common_trigrams, 1), 0)::FLOAT4 / max_common::FLOAT4;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION corrected_similarity(string_a TEXT, string_b TEXT)
RETURNS FLOAT4 AS $$
DECLARE
base_score FLOAT4;
BEGIN
base_score := substring_similarity(string_a, string_b);
-- a good standard similarity score can raise the base_score
RETURN base_score + ((1.0 - base_score) * SIMILARITY(string_a, string_b));
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION is_minimally_substring_similar(string_a TEXT, string_b TEXT) RETURNS BOOLEAN AS $$
BEGIN
RETURN corrected_similarity(string_a, string_b) >= 0.5;
END;
$$ LANGUAGE plpgsql;
CREATE OPERATOR %%% (
leftarg = TEXT,
rightarg = TEXT,
procedure = is_minimally_substring_similar,
commutator = %%%
);
现在,您可以按照与标准相似性查询相同的方式使用它:
SELECT * FROM table WHERE name %%% 'chateau'
ORDER BY corrected_similarity(name, 'chateau') DESC;
对于10万条记录的搜索空间,性能是可以接受的,但对于数百万个搜索空间的性能来说,可能不是很好。为此,您可能要使用modified build of the pg_trgm module,code on github。