我将如何在Postgres表的varchar列中查找包含相同3个词组的行?
在其他问题中,大多数全文搜索建议都是将向量与特定查询进行比较,但是我要寻找的是包含 any 3个词组的行与其他行。
示例:
SELECT *
FROM types t1
WHERE EXISTS (SELECT *
FROM types t2
WHERE t1.name phrase_matches t2.name
AND t1.id > t2.id)
在这里,phrase_matches
是一个组合操作,其中
'my foo bar baz' phrase_matches 'foo bar baz whatever'
返回true
和
'my foo bar baz' phrase_matches 'foo baz whatever bar'
返回false
编辑:对来自Google的所有人进行更新-没有临时表的解决方案(使用联接)在具有18,000行的表上花费了一个多小时。临时表版本总共运行了几秒钟。
答案 0 :(得分:2)
制作一个trigrams-to-row-ids表,然后在trigram列上自联接。浪费很多空间,但是放下最简单的方法。在klin's answer to How to extract n-gram word sequences from text in Postgres的帮助下:
-- your table
CREATE TABLE phrases (
id INT,
phrase TEXT
);
-- your data
INSERT INTO phrases (id, phrase) VALUES
(1, 'my foo bar baz'),
(2, 'foo bar baz whatever'),
(3, 'foo baz whatever bar');
-- function to extract word n-grams
-- from https://stackoverflow.com/a/51571001/240443
CREATE OR REPLACE FUNCTION word_ngrams(str TEXT, n INT)
RETURNS SETOF TEXT LANGUAGE plpgsql AS $$
DECLARE
i INT;
arr TEXT[];
BEGIN
str := regexp_replace(str, '[^[:alnum:]|\s]', '', 'g');
arr := string_to_array(str, ' ');
FOR i in 1 .. cardinality(arr) - n + 1 LOOP
RETURN NEXT array_to_string(arr[i : i+n-1], ' ');
END LOOP;
END $$;
-- table of all trigrams (my foo bar, foo bar baz, bar baz whatever...)
-- and rows they belong to
CREATE TEMPORARY TABLE trigrams (
id INT,
trigram TEXT
);
-- make sure JOIN doesn't take forever
CREATE INDEX ON trigrams (trigram, id);
-- extract the trigrams into their stylish new - yet temporary - home
INSERT INTO trigrams SELECT id, word_ngrams(phrase, 3) FROM phrases;
-- see which original rows have common trigrams
SELECT DISTINCT T1.id AS id1, T2.id AS id2
FROM trigrams T1 JOIN trigrams T2
ON T1.trigram = T2.trigram
AND T1 < T2;
-- | id1 | id2
---+-----+----
-- | 1 | 2
您也可以不使用临时表而直接使用word_ngrams
函数,但这会慢很多。时间或空间,仅选择一个:P这将从CREATE TEMPORARY TABLE
开始替换先前代码段中的所有内容(但仍使用klin的出色功能)。
SELECT DISTINCT T1.id AS id1, T2.id AS id2
FROM phrases T1 JOIN phrases T2
ON EXISTS (
SELECT word_ngrams(T1.phrase, 3)
INTERSECT
SELECT word_ngrams(T2.phrase, 3)
)
AND T1.id < T2.id;
-- | id1 | id2
---+-----+----
-- | 1 | 2
答案 1 :(得分:1)
WITH words AS (
SELECT phrase, unnest, row_number() OVER ()
FROM (
SELECT phrase, unnest(string_to_array(phrase, ' '))
FROM phrases
)s
), phrase_parts AS (
SELECT
phrase, array_to_string(array_agg, ' ') as check_phrase
FROM (
SELECT
w1.phrase, array_agg(w2.unnest) OVER (PARTITION BY w1.row_number ORDER BY w2.row_number)
FROM words w1
JOIN words w2
ON w1.phrase = w2.phrase and w1.row_number <= w2.row_number
ORDER BY w1.row_number, w2.row_number
) s
WHERE array_length(array_agg, 1) = 3
)
SELECT p.phrase as a, pp.phrase as b, pp.check_phrase
FROM
phrases p
JOIN
phrase_parts pp
ON p.phrase LIKE '%' || pp.check_phrase || '%' and p.phrase <> pp.phrase
扩展的数据集:
phrase
my foo bar baz
foo baz whatever bar
foo bar baz whatever
blah my foo bar blah
blah my foo baz blah
结果:
a b check_phrase
blah my foo bar blah my foo bar baz my foo bar
foo bar baz whatever my foo bar baz foo bar baz
my foo bar baz foo bar baz whatever foo bar baz
blah my foo baz blah blah my foo bar blah blah my foo
my foo bar baz blah my foo bar blah my foo bar
blah my foo bar blah blah my foo baz blah blah my foo
CTE words
创建所有短语的所有单词的列表。所有单词都将获得索引以确保其短语中的原始顺序。
CTE phrase_parts
创建所有可能的3个单词短语:对于每个原始短语,所有单词都被加入。
加入后的结果如下:
phrase unnest row_number phrase unnest row_number
my foo bar baz my 1 my foo bar baz my 1
my foo bar baz my 1 my foo bar baz foo 2
my foo bar baz my 1 my foo bar baz bar 3
my foo bar baz my 1 my foo bar baz baz 4
my foo bar baz foo 2 my foo bar baz foo 2
my foo bar baz foo 2 my foo bar baz bar 3
my foo bar baz foo 2 my foo bar baz baz 4
my foo bar baz bar 3 my foo bar baz bar 3
my foo bar baz bar 3 my foo bar baz baz 4
my foo bar baz baz 4 my foo bar baz baz 4
foo baz whatever bar foo 5 foo baz whatever bar foo 5
foo baz whatever bar foo 5 foo baz whatever bar baz 6
foo baz whatever bar foo 5 foo baz whatever bar whatever 7
foo baz whatever bar foo 5 foo baz whatever bar bar 8
foo baz whatever bar baz 6 foo baz whatever bar baz 6
...
借助window function array_agg()
,我可以通过这种方式汇总第二个unnest
列:
array_agg
{my}
{my,foo}
{my,foo,bar}
{my,foo,bar,baz}
{foo}
{foo,bar}
{foo,bar,baz}
{bar}
{bar,baz}
{baz}
{foo}
{foo,baz}
{foo,baz,whatever}
{foo,baz,whatever,bar}
...
已为array length = 3
过滤并重新转换为字符串。结果是三个单词短语:
答案 2 :(得分:0)
也许有更好的选择,但您也可以执行以下操作。并非完全是您要的东西,但是我敢肯定,您将能够将这个想法付诸实践。
select n.name from(
select x.name as xname,count(*) from
(
(
select name,unnest(string_to_array(name2,' ')) as name2
from new
)as x
inner join
(
select name,unnest(string_to_array(name,' ')) as name1
from new
)as y
on x.name2=y.name1 and y.id>x.id
) group by x.name having count(*)>=3)r inner join new n on r.xname=n.name
这里有个小玩意儿:https://www.db-fiddle.com/f/phLirNij577PwEpd8UERef/0
请注意,我没有在小提琴中包含id,但是您可以自己这样做。