我有一张桌子T1:
Col
-------
1 THE APPLE
THE APPLE
THE APPLE 123
THE APPLE 12/16
BEST THE APPLE
我想要T2:
Result
--------
THE APPLE
我正在使用Redshift,正在寻找一些在SQL中进行fuzzystring匹配的方法。列可能的最长长度为100个字符。在任何时候我都不得超过25行。
答案 0 :(得分:2)
这个问题需要相当程度的复杂化来解决,并且随着字符串长度和记录数量的增加,它的运行时间会急剧增加。但是,鉴于您的表T1相当小,您可能只使用下面的PL / pgSQL函数进行管理。
NULL
。下面代码中的重要一点是检查匹配时的短路:只要单个记录与候选字符串col
不匹配,就不需要进一步检查。因此,对于长字符串,比较实际上来自最短的字符串和另一个字符串,只有当候选字符串变得如此之短以至于它们确实更普遍时,才会增加检查的行。
字符串比较区分大小写;如果您想使其不区分大小写,请将LIKE
更改为ILIKE
。作为一个额外的功能,您将获得所有行中都存在的多个匹配字符串(显然所有长度相同)。在不利的一面,它会报告多个相同的字符串,一旦它变为单个字符匹配(并且可能有一些2-char和更长的字符串)。您可以使用SELECT DISTINCT *
来删除这些重复项。
CREATE FUNCTION find_longest_string_in_T1() RETURNS SETOF text AS $$
DECLARE
shortest varchar; -- The shortest string in T1(col) so the longest possible match
candidate varchar; -- Candidate string to test
sz_sh integer; -- Length of "shortest"
l integer := 1; -- Starting position of "candidate" in "shortest"
sz integer; -- Length of "candidate"
fail boolean; -- Has "candidate" been found in T1(col)?
found_one boolean := false; -- Flag if we found at least one match
BEGIN
-- Find the shortest string and its size, don't worry about multiples, need just 1
SELECT col, char_length(col) INTO shortest, sz_sh
FROM T1
ORDER BY char_length(col) ASC NULLS LAST
LIMIT 1;
-- Get all the candidates from the shortest string and test them from longest to single char
candidate := shortest;
sz := sz_sh;
LOOP
-- Check rows in T1 if they contain the candidate string.
-- Short-circuit as soon as a record does not match the candidate
<<check_T1>>
BEGIN
FOR fail IN SELECT col NOT LIKE '%' || candidate || '%' FROM T1 LOOP
EXIT check_T1 WHEN fail;
END LOOP;
-- Block was not exited, so the candidate is present in all rows: we have a match
RETURN NEXT candidate;
found_one := true;
END;
-- Produce the next candidate
IF l+sz > sz_sh THEN -- "candidate" reaches to the end of "shortest"
-- Exit if we already have at least one matching candidate
EXIT WHEN found_one;
-- .. otherwise shorthen the candidate
sz := sz - 1;
l := 1;
ELSE
-- Exit with empty result when all candidates have been examined
EXIT WHEN l = sz_sh;
-- .. otherwise move one position over to get the next candidate
l := l + 1;
END IF;
candidate := substring(shortest from l for sz);
END LOOP;
RETURN;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
调用SELECT * FROM find_longest_string_in_T1()
应该可以解决问题。
创建一些测试数据:
INSERT INTO T1
SELECT 'hello' || md5(random()::text) || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1
SELECT md5(random()::text) || 'match' || 'hello' || md5(random()::text) || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1
SELECT 'match' || md5(random()::text) || 'hello' || md5(random()::text) || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1
SELECT md5(random()::text) || 'hello' || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
这会产生100行,每行106个字符,并产生匹配&#34;你好&#34;和#34;匹配&#34; (并且不太可能有任何其他比赛)。这可以在不到半秒的时间内生成正确的两个字符串(没有多余的Ubuntu服务器,PG 9.3,CPU i5,4GB内存)。
答案 1 :(得分:1)
如果你可以在所有行中找到最常出现的单词(最常用的单词用空格分隔),你可以使用:
select word, count(distinct rn) as num_rows
from(
select unnest(string_to_array(col, ' ')) as word,
row_number() over(order by col) as rn
from tbl
) x
group by word
order by num_rows desc
小提琴: http://sqlfiddle.com/#!15/bc803/9/0
请注意,这会在4行中找到单词apple
,而不是5.这是因为APPLE123
是一个单词,而APPLE 123
是两个单词,其中一个是APPLE,并且会计算,但它没有。