Question

我有一张桌子T1：

    Col
   -------
  1 THE APPLE
 THE APPLE
 THE APPLE 123
 THE APPLE 12/16
 BEST THE APPLE

我想要T2：

 Result
--------
 THE APPLE

我正在使用Redshift，正在寻找一些在SQL中进行fuzzystring匹配的方法。列可能的最长长度为100个字符。在任何时候我都不得超过25行。

Answer 1

这个问题需要相当程度的复杂化来解决，并且随着字符串长度和记录数量的增加，它的运行时间会急剧增加。但是，鉴于您的表T1相当小，您可能只使用下面的PL / pgSQL函数进行管理。

算法

找到T1（col）中的最短值。这是所有记录中最长的匹配。这是候选字符串。
查看候选人是否出现在T1的所有其他行中。如果是，请将当前候选人返回到结果集。
将候选人移到最短值的一个位置，然后回到第2步，直到候选人到达最短字符串的末尾。
如果找到匹配的候选人，请从该功能返回。否则，将候选项缩短1并从最短字符串的开头重新开始并转到步骤2.如果不能从最短字符串中提取更多候选项，则返回NULL。

代码

下面代码中的重要一点是检查匹配时的短路：只要单个记录与候选字符串col不匹配，就不需要进一步检查。因此，对于长字符串，比较实际上来自最短的字符串和另一个字符串，只有当候选字符串变得如此之短以至于它们确实更普遍时，才会增加检查的行。

字符串比较区分大小写;如果您想使其不区分大小写，请将LIKE更改为ILIKE。作为一个额外的功能，您将获得所有行中都存在的多个匹配字符串（显然所有长度相同）。在不利的一面，它会报告多个相同的字符串，一旦它变为单个字符匹配（并且可能有一些2-char和更长的字符串）。您可以使用SELECT DISTINCT *来删除这些重复项。

CREATE FUNCTION find_longest_string_in_T1() RETURNS SETOF text AS $$
DECLARE
  shortest  varchar;       -- The shortest string in T1(col) so the longest possible match
  candidate varchar;       -- Candidate string to test
  sz_sh     integer;       -- Length of "shortest"
  l         integer := 1;  -- Starting position of "candidate" in "shortest"
  sz        integer;       -- Length of "candidate"
  fail      boolean;       -- Has "candidate" been found in T1(col)?
  found_one boolean := false; -- Flag if we found at least one match
BEGIN
  -- Find the shortest string and its size, don't worry about multiples, need just 1
  SELECT col, char_length(col) INTO shortest, sz_sh
  FROM T1
  ORDER BY char_length(col) ASC NULLS LAST
  LIMIT 1;

  -- Get all the candidates from the shortest string and test them from longest to single char
  candidate := shortest;
  sz := sz_sh;
  LOOP
    -- Check rows in T1 if they contain the candidate string.
    -- Short-circuit as soon as a record does not match the candidate
    <<check_T1>>
    BEGIN
      FOR fail IN SELECT col NOT LIKE '%' || candidate || '%' FROM T1 LOOP
        EXIT check_T1 WHEN fail;
      END LOOP;
      -- Block was not exited, so the candidate is present in all rows: we have a match
      RETURN NEXT candidate;
      found_one := true;
    END;

    -- Produce the next candidate
    IF l+sz > sz_sh THEN -- "candidate" reaches to the end of "shortest"
      -- Exit if we already have at least one matching candidate
      EXIT WHEN found_one;
      -- .. otherwise shorthen the candidate
      sz := sz - 1;
      l := 1;
    ELSE
      -- Exit with empty result when all candidates have been examined
      EXIT WHEN l = sz_sh;
      -- .. otherwise move one position over to get the next candidate
      l := l + 1;
    END IF;
    candidate := substring(shortest from l for sz);
  END LOOP;

  RETURN;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

调用SELECT * FROM find_longest_string_in_T1()应该可以解决问题。

简单测试

创建一些测试数据：

INSERT INTO T1 
  SELECT 'hello' || md5(random()::text) || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'match' || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT 'match' || md5(random()::text) || 'hello' || md5(random()::text)  || md5(random()::text) FROM generate_series(1, 25);
INSERT INTO T1 
  SELECT md5(random()::text) || 'hello' || md5(random()::text) || 'match' || md5(random()::text) FROM generate_series(1, 25);

这会产生100行，每行106个字符，并产生匹配＆＃34;你好＆＃34;和＃34;匹配＆＃34; （并且不太可能有任何其他比赛）。这可以在不到半秒的时间内生成正确的两个字符串（没有多余的Ubuntu服务器，PG 9.3，CPU i5,4GB内存）。

Answer 2

如果你可以在所有行中找到最常出现的单词（最常用的单词用空格分隔），你可以使用：

select word, count(distinct rn) as num_rows
from(
select unnest(string_to_array(col, ' ')) as word,
       row_number() over(order by col) as rn
from tbl
) x
group by word
order by num_rows desc

小提琴： http://sqlfiddle.com/#!15/bc803/9/0

请注意，这会在4行中找到单词apple，而不是5.这是因为APPLE123是一个单词，而APPLE 123是两个单词，其中一个是APPLE，并且会计算，但它没有。

SQL：查找行之间最长的公共字符串

2 个答案:

算法

代码

简单测试