PostgreSQL,三元组和相似性

时间:2017-04-01 12:40:16

标签: postgresql similarity trigram

在我的Mac上测试PostgreSQL 9.6.2并使用Ngrams。 假设酿酒厂有一个GIN三元组指数。

相似度的限制(我知道这已被弃用):

SELECT set_limit(0.5);

我在2,3M行表上构建了一个三元组搜索。

我的选择代码:

SELECT winery, similarity(winery, 'chateau chevla blanc') AS similarity 
FROM usr_wines 
WHERE status=1 AND winery % 'chateau chevla blanc'  
ORDER BY similarity DESC;

我的结果(在我的Mac上为329毫秒):

Chateau ChevL Blanc 0,85
Chateau Blanc   0,736842
Chateau Blanc   0,736842
Chateau Blanc   0,736842
Chateau Blanc   0,736842
Chateau Blanc,  0,736842
Chateau Blanc   0,736842
Chateau Cheval Blanc    0,727273
Chateau Cheval Blanc    0,727273
Chateau Cheval Blanc    0,727273
Chateau Cheval Blanc (7)    0,666667
Chateau Cheval Blanc Cbo    0,64
Chateau Du Cheval Blanc 0,64
Chateau Du Cheval Blanc 0,64

好吧,我不明白怎么可能" Chateau blanc"有相似性>去#C; Chateau Cheval Blanc"在这种情况下 ?据我所知,这2个字完全相同"城堡"和" blanc",但没有其他的词" cheval"。

为什么" Chateau ChevL Blanc"是第一个?一封信" a"不见了!

好吧,我的目标是在给出酒庄名称时匹配所有可能的副本,即使它被误解了。我错过了什么?

3 个答案:

答案 0 :(得分:17)

三元组相似性的概念依赖于将任何句子分成“三元组”(三个连续字母的序列),并将结果视为一个SET(即:顺序无关紧要,你没有重复值)。在考虑句子之前,在开头添加两个空格,在结尾添加一个空格,单个空格替换为双空格。

Trigrams N-grams的一个特例。

通过查找出现在其上的三个字母的所有序列,找到对应于“Chateau blanc”的三元组:

  chateau  blanc
---                 => '  c'
 ---                => ' ch'
  ---               => 'cha'
   ---              => 'hat'
    ---             => 'ate'
     ---            => 'tea'
      ---           => 'eau'
       ---          => 'au '
        ---         => 'u  '
         ---        => '  b'
          ---       => ' bl'
           ---      => 'bla'
            ---     => 'lan'
             ---    => 'anc'
              ---   => 'nc '

对它们进行排序,然后取出重复项就可以了:

'  b'
'  c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'

这可以通过PostgreSQL通过函数show_trgm来计算:

SELECT show_trgm('Chateau blanc') AS A

A = [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

...有14个三元组。 (检查pg_trgm)。

对应于“Chateau Cheval Blanc”的三元组是:

SELECT show_trgm('Chateau Cheval Blanc') AS B 

B = [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

......有19个三元组

如果你计算三个三元组有多少共同点,你会发现它们有以下几个:

A intersect B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

他们总共有:

A union B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

也就是说,两个句子共有14个三元组,总共19个 相似度计算如下:

 similarity = 14 / 19

您可以查看:

SELECT 
    cast(14.0/19.0 as real) AS computed_result, 
    similarity('Chateau blanc', 'chateau chevla blanc') AS function_in_pg

您会看到: 0.736842

...解释 如何计算 相似度, 为什么 获得您获得的值。

注意:您可以通过以下方式计算交叉点和并集:

SELECT 
   array_agg(t) AS in_common
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    INTERSECT 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
    ORDER BY t
) AS trigrams_in_common ;

SELECT 
   array_agg(t) AS in_total
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    UNION 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;

这是一种探索不同句子对的相似性的方法:

WITH p AS
(
    SELECT 
      'This is just a sentence I''ve invented'::text AS f1,
      'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
    SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
    SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
    SELECT
        (SELECT count(*) FROM 
            (SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
        (SELECT count(*) FROM 
            (SELECT * FROM t1 UNION     SELECT * FROM t2) AS s0)::integer AS total,
        similarity(f1, f2) AS sim_2
FROM
    p 
)
SELECT
    same, total, same::real/total::real AS sim_1, sim_2
FROM
    x ;

您可以在Rextester

查看

答案 1 :(得分:4)

三元组算法应该越准确,相比字符串的长度差异越小。您可以修改算法以补偿长度差异的影响。

以下示例性函数将字符串长度中1个字符的差异的相似度降低1%。这意味着它有利于相同(相似)长度的字符串。

create or replace function corrected_similarity(str1 text, str2 text)
returns float4 language sql as $$
    select similarity(str1, str2)* (1- abs(length(str1)-length(str2))/100.0)::float4
$$;

select 
    winery, 
    similarity(winery, 'chateau chevla blanc') as similarity,
    corrected_similarity(winery, 'chateau chevla blanc') as corrected_similarity
from usr_wines 
where winery % 'chateau chevla blanc'  
order by corrected_similarity desc;

          winery          | similarity | corrected_similarity 
--------------------------+------------+----------------------
 Chateau ChevL Blanc      |       0.85 |               0.8415
 Chateau Cheval Blanc     |   0.727273 |             0.727273
 Chateau Cheval Blanc     |   0.727273 |             0.727273
 Chateau Cheval Blanc     |   0.727273 |             0.727273
 Chateau Blanc,           |   0.736842 |             0.692632
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Cheval Blanc (7) |   0.666667 |                 0.64
 Chateau Du Cheval Blanc  |       0.64 |               0.6208
 Chateau Du Cheval Blanc  |       0.64 |               0.6208
 Chateau Cheval Blanc Cbo |       0.64 |               0.6144
(14 rows)

以类似的方式,你可以通过例如多少个初始字符是相同的来校正标准相似性(认为函数会有点复杂)。

答案 2 :(得分:0)

有时您想要与克林答案相反。在某些应用中,字符串长度的巨大差异不应导致如此显着的得分损失。

例如,想象一下一个自动完成结果表格,其中包含三字组匹配的建议,这些建议在您键入时会有所改善。

这是另一种评分匹配的方法,该方法仍然使用三字组,但更喜欢子字符串匹配。

相似性公式改为以

开始
the number of common trigrams
-------------------------------------------
the number of trigrams in the shortest word    <-- key difference

,并且可以根据良好的标准相似性得分从那里上升。

CREATE OR REPLACE FUNCTION substring_similarity(string_a TEXT, string_b TEXT) RETURNS FLOAT4 AS $$
DECLARE
  a_trigrams TEXT[];
  b_trigrams TEXT[];
  a_tri_len INTEGER;
  b_tri_len INTEGER;
  common_trigrams TEXT[];
  max_common INTEGER;
BEGIN
  a_trigrams = SHOW_TRGM(string_a);
  b_trigrams = SHOW_TRGM(string_b);
  a_tri_len = ARRAY_LENGTH(a_trigrams, 1);
  b_tri_len = ARRAY_LENGTH(b_trigrams, 1);
  IF (NOT (a_tri_len > 0) OR NOT (b_tri_len > 0)) THEN
    IF (string_a = string_b) THEN
      RETURN 1;
    ELSE
      RETURN 0;
    END IF;
  END IF;
  common_trigrams := ARRAY(SELECT UNNEST(a_trigrams) INTERSECT SELECT UNNEST(b_trigrams));
  max_common = LEAST(a_tri_len, b_tri_len);
  RETURN COALESCE(ARRAY_LENGTH(common_trigrams, 1), 0)::FLOAT4 / max_common::FLOAT4;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION corrected_similarity(string_a TEXT, string_b TEXT) 
RETURNS FLOAT4 AS $$
DECLARE
  base_score FLOAT4;
BEGIN
  base_score := substring_similarity(string_a, string_b);
  -- a good standard similarity score can raise the base_score
  RETURN base_score + ((1.0 - base_score) * SIMILARITY(string_a, string_b));
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION is_minimally_substring_similar(string_a TEXT, string_b TEXT) RETURNS BOOLEAN AS $$
BEGIN
  RETURN corrected_similarity(string_a, string_b) >= 0.5;
END;
$$ LANGUAGE plpgsql;

CREATE OPERATOR %%% (
  leftarg = TEXT,
  rightarg = TEXT,
  procedure = is_minimally_substring_similar,
  commutator = %%%
);

现在,您可以按照与标准相似性查询相同的方式使用它:

SELECT * FROM table WHERE name %%% 'chateau'
ORDER BY corrected_similarity(name, 'chateau') DESC;

性能

对于10万条记录的搜索空间,性能是可以接受的,但对于数百万个搜索空间的性能来说,可能不是很好。为此,您可能要使用modified build of the pg_trgm modulecode on github