PostgreSQL中的短语比较

时间:2018-12-07 12:37:03

标签: postgresql

我将如何在Postgres表的varchar列中查找包含相同3个词组的行?

在其他问题中,大多数全文搜索建议都是将向量与特定查询进行比较,但是我要寻找的是包含 any 3个词组的行与其他行。

示例:

SELECT * 
FROM types t1 
WHERE EXISTS (SELECT * 
              FROM types t2 
              WHERE t1.name phrase_matches t2.name 
                AND t1.id > t2.id)

在这里,phrase_matches是一个组合操作,其中

'my foo bar baz' phrase_matches 'foo bar baz whatever'返回true

'my foo bar baz' phrase_matches 'foo baz whatever bar'返回false

编辑:对来自Google的所有人进行更新-没有临时表的解决方案(使用联接)在具有18,000行的表上花费了一个多小时。临时表版本总共运行了几秒钟。

3 个答案:

答案 0 :(得分:2)

制作一个trigrams-to-row-ids表,然后在trigram列上自联接。浪费很多空间,但是放下最简单的方法。在klin's answer to How to extract n-gram word sequences from text in Postgres的帮助下:

-- your table
CREATE TABLE phrases (
  id INT,
  phrase TEXT
);

-- your data
INSERT INTO phrases (id, phrase) VALUES
(1, 'my foo bar baz'),
(2, 'foo bar baz whatever'),
(3, 'foo baz whatever bar');

-- function to extract word n-grams
-- from https://stackoverflow.com/a/51571001/240443
CREATE OR REPLACE FUNCTION word_ngrams(str TEXT, n INT)
RETURNS SETOF TEXT LANGUAGE plpgsql AS $$
DECLARE
    i INT;
    arr TEXT[];
BEGIN
    str := regexp_replace(str, '[^[:alnum:]|\s]', '', 'g');
    arr := string_to_array(str, ' ');
    FOR i in 1 .. cardinality(arr) - n + 1 LOOP
        RETURN NEXT array_to_string(arr[i : i+n-1], ' ');
    END LOOP;
END $$;

-- table of all trigrams (my foo bar, foo bar baz, bar baz whatever...)
-- and rows they belong to
CREATE TEMPORARY TABLE trigrams (
  id INT,
  trigram TEXT
);

-- make sure JOIN doesn't take forever
CREATE INDEX ON trigrams (trigram, id);

-- extract the trigrams into their stylish new - yet temporary - home
INSERT INTO trigrams SELECT id, word_ngrams(phrase, 3) FROM phrases;

-- see which original rows have common trigrams
SELECT DISTINCT T1.id AS id1, T2.id AS id2
FROM trigrams T1 JOIN trigrams T2
  ON T1.trigram = T2.trigram
  AND T1 < T2;

-- | id1 | id2
---+-----+----
-- |   1 |   2

您也可以不使用临时表而直接使用word_ngrams函数,但这会慢很多。时间或空间,仅选择一个:P这将从CREATE TEMPORARY TABLE开始替换先前代码段中的所有内容(但仍使用klin的出色功能)。

SELECT DISTINCT T1.id AS id1, T2.id AS id2
FROM phrases T1 JOIN phrases T2
  ON EXISTS (
    SELECT word_ngrams(T1.phrase, 3)
    INTERSECT
    SELECT word_ngrams(T2.phrase, 3)
  )
  AND T1.id < T2.id;

-- | id1 | id2
---+-----+----
-- |   1 |   2

答案 1 :(得分:1)

demo: db<>fiddle

WITH words AS (
    SELECT phrase, unnest, row_number() OVER ()
    FROM (
        SELECT phrase, unnest(string_to_array(phrase, ' '))
        FROM phrases
    )s
), phrase_parts AS (

    SELECT 
        phrase, array_to_string(array_agg, ' ') as check_phrase
    FROM (
        SELECT
            w1.phrase, array_agg(w2.unnest) OVER (PARTITION BY w1.row_number ORDER BY w2.row_number)
        FROM words w1
        JOIN words w2
        ON w1.phrase = w2.phrase and w1.row_number <= w2.row_number

        ORDER BY w1.row_number, w2.row_number
    ) s
    WHERE array_length(array_agg, 1) = 3
)
SELECT p.phrase as a, pp.phrase as b, pp.check_phrase 
FROM 
    phrases p 
JOIN 
    phrase_parts pp 
ON p.phrase LIKE '%' || pp.check_phrase || '%' and p.phrase <> pp.phrase

扩展的数据集:

phrase
my foo bar baz
foo baz whatever bar
foo bar baz whatever
blah my foo bar blah
blah my foo baz blah

结果:

a                      b                      check_phrase
blah my foo bar blah   my foo bar baz         my foo bar
foo bar baz whatever   my foo bar baz         foo bar baz
my foo bar baz         foo bar baz whatever   foo bar baz
blah my foo baz blah   blah my foo bar blah   blah my foo
my foo bar baz         blah my foo bar blah   my foo bar
blah my foo bar blah   blah my foo baz blah   blah my foo
  1. CTE words创建所有短语的所有单词的列表。所有单词都将获得索引以确保其短语中的原始顺序。

  2. CTE phrase_parts创建所有可能的3个单词短语:对于每个原始短语,所有单词都被加入。

加入后的结果如下:

phrase                 unnest   row_number   phrase                 unnest     row_number
my foo bar baz         my       1            my foo bar baz         my         1
my foo bar baz         my       1            my foo bar baz         foo        2
my foo bar baz         my       1            my foo bar baz         bar        3
my foo bar baz         my       1            my foo bar baz         baz        4
my foo bar baz         foo      2            my foo bar baz         foo        2
my foo bar baz         foo      2            my foo bar baz         bar        3
my foo bar baz         foo      2            my foo bar baz         baz        4
my foo bar baz         bar      3            my foo bar baz         bar        3
my foo bar baz         bar      3            my foo bar baz         baz        4
my foo bar baz         baz      4            my foo bar baz         baz        4
foo baz whatever bar   foo      5            foo baz whatever bar   foo        5
foo baz whatever bar   foo      5            foo baz whatever bar   baz        6
foo baz whatever bar   foo      5            foo baz whatever bar   whatever   7
foo baz whatever bar   foo      5            foo baz whatever bar   bar        8
foo baz whatever bar   baz      6            foo baz whatever bar   baz        6
...

借助window function array_agg(),我可以通过这种方式汇总第二个unnest列:

array_agg
{my}
{my,foo}
{my,foo,bar}
{my,foo,bar,baz}
{foo}
{foo,bar}
{foo,bar,baz}
{bar}
{bar,baz}
{baz}
{foo}
{foo,baz}
{foo,baz,whatever}
{foo,baz,whatever,bar}
...

已为array length = 3过滤并重新转换为字符串。结果是三个单词短语:

  1. 最后一步是检查表中的所有短语是否包含3个单词短语中的任何一个(并且不等于其原始短语)

答案 2 :(得分:0)

也许有更好的选择,但您也可以执行以下操作。并非完全是您要的东西,但是我敢肯定,您将能够将这个想法付诸实践。

select n.name from(
select x.name as xname,count(*) from 
(
  (
    select name,unnest(string_to_array(name2,' '))  as name2
                              from new
  )as x
    inner join
    (
        select name,unnest(string_to_array(name,' ')) as name1
         from new
    )as y
    on x.name2=y.name1 and y.id>x.id
) group by x.name having count(*)>=3)r inner join new n on r.xname=n.name

这里有个小玩意儿:https://www.db-fiddle.com/f/phLirNij577PwEpd8UERef/0

请注意,我没有在小提琴中包含id,但是您可以自己这样做。