MySQL - 如何查找具有最相似开头的单词

时间:2017-10-22 09:55:13

标签: mysql sql binary-search-tree

如何找到varchar - MySQL数据库中指定单词开头最相似的单词?

例如:

+-------------------+
|    word_column    | 
+-------------------+
| StackOferflow     |
| StackExchange     |
| MetaStackExchange |
|       ....        |
+-------------------+

查询:call get_with_similar_begin('StackExch_bla_bla_bla');
输出:'StackExchange'

查询:call get_with_similar_begin('StackO_bla_bla_bla');
输出:'StackOferflow'

更新:

Select * from words where word_column like 'StackExch_bla_bla_bla'无法提供正确的结果,因为'StackExchange'与此过滤条件不匹配。

其他信息:BTREE-index上有word_column,我希望尽可能使用

4 个答案:

答案 0 :(得分:2)

在SQL Server中,我们可以像下面的查询一样使用CTE来实现你想要的目标:

declare @search nvarchar(255) = 'StackExch_bla_bla_bla';

-- A cte that contains `StackExch_bla_bla_bla` sub-strings: {`StackExch_bla_bla_bla`, `StackExch_bla_bla_bl`, ...,  `S`}
with cte(part, lvl) as (  
    select @search, 1
    union all 
    select substring(@search, 1, len(@search) - lvl), lvl + 1
    from cte
    where lvl < len(@search)
), t as (   -- Now below cte will find match level of each word_column
    select t.word_column, min(cte.lvl) matchLvl
    from yourTable t
    left join cte
      on t.word_column like cte.part+'%'
    group by t.word_column
)
select top(1) word_column
from t
where matchLvl is not null   -- remove non-matched rows
order by matchLvl;

SQL Server Fiddle Demo

我需要更多时间在MySQL中寻找方法,希望一些MySQL专家能够更快地回答;)。

我在MySQL中的最佳尝试是:

select tt.word_column
from (
  select t.word_column, min(lvl) matchLvl
  from yourTable t
  join (
    select 'StackExch_bla_bla_bla' part, 1 lvl
    union all select 'StackExch_bla_bla_bl', 2
    union all select 'StackExch_bla_bla_b', 3
    union all select 'StackExch_bla_bla_', 4
    union all select 'StackExch_bla_bla', 5
    union all select 'StackExch_bla_bl', 6
    union all select 'StackExch_bla_b', 7
    union all select 'StackExch_bla_', 8
    union all select 'StackExch_bla', 9
    union all select 'StackExch_bl', 10
    union all select 'StackExch_b', 11
    union all select 'StackExch_', 12
    union all select 'StackExch', 13
    union all select 'StackExc', 14
    union all select 'StackEx', 15
    union all select 'StackE', 16
    union all select 'Stack', 17
    union all select 'Stac', 18
    union all select 'Sta', 19
    union all select 'St', 20
    union all select 'S', 21
  ) p on t.word_column like concat(p.part, '%')
  group by t.word_column
  ) tt
order by matchLvl
limit 1;

我认为通过创建存储过程并使用临时表在p子选择中存储值,您可以实现您想要的--HTH;)。

MySQL Fiddle Demo

答案 1 :(得分:2)

@ shA.t的答案略有不同。聚合不是必需的:

select t.*, p.lvl
from yourTable t join
     (select 'StackExch_bla_bla_bla' as part, 1 as lvl union all
      select 'StackExch_bla_bla_bl', 2 union all
      select 'StackExch_bla_bla_b', 3 union all
      select 'StackExch_bla_bla_', 4 union all
      select 'StackExch_bla_bla', 5 union all
      select 'StackExch_bla_bl', 6 union all
      select 'StackExch_bla_b', 7 union all
      select 'StackExch_bla_', 8 union all
      select 'StackExch_bla', 9 union all
      select 'StackExch_bl', 10 union all
      select 'StackExch_b', 11 union all
      select 'StackExch_', 12 union all
      select 'StackExch', 13 union all
      select 'StackExc', 14 union all
      select 'StackEx', 15 union all
      select 'StackE', 16 union all
      select 'Stack', 17 union all
      select 'Stac', 18 union all
      select 'Sta', 19 union all
      select 'St', 20 union all
      select 'S', 21
     ) p
     on t.word_column like concat(p.part, '%')
order by matchLvl
limit 1;

更快捷的方法是使用case

select t.*,
       (case when t.word_column like concat('StackExch_bla_bla_bla', '%') then 'StackExch_bla_bla_bla'
             when t.word_column like concat('StackExch_bla_bla_bl', '%') then 'StackExch_bla_bla_bl'
             when t.word_column like concat('StackExch_bla_bla_b', '%') then 'StackExch_bla_bla_b'
             . . .
             when t.word_column like concat('S', '%') then 'S'
             else ''
        end) as longest_match
from t
order by length(longest_match) desc
limit 1;

这些都不会有效地使用索引。

如果你想要一个使用索引的版本,那么在应用层进行循环,然后重复运行查询:

select t.*
from t
where t.word_column like 'StackExch_bla_bla_bla%'
limit 1;

然后在第一场比赛中停止。 MySQL应该使用like比较的索引。

您可以使用union all

来接近这一点
(select t.*, 'StackExch_bla_bla_bla' as matching
 from t
 where t.word_column like 'StackExch_bla_bla_bla%'
 limit 1
) union all
(select t.*, 'StackExch_bla_bla_bl'
 from t
 where t.word_column like 'StackExch_bla_bla_bl%'
 limit 1
) union all
(select t.*, 'StackExch_bla_bla_b'
 from t
 where t.word_column like 'StackExch_bla_bla_b%'
 limit 1
) union al
. . .
(select t.*, 'S'
 from t
 where t.word_column like 'S%'
 limit 1
)
order by length(matching) desc
limit 1;

答案 2 :(得分:2)

创建表/插入数据。

CREATE DATABASE IF NOT EXISTS stackoverflow;
USE stackoverflow;

DROP TABLE IF EXISTS word;
CREATE TABLE IF NOT EXISTS word(
      word_column VARCHAR(255)
    , KEY(word_column)
)
;

INSERT INTO word
    (`word_column`)
VALUES
    ('StackOverflow'),
    ('StackExchange'),
    ('MetaStackExchange')
;

此解决方案取决于生成大量列表。 我们可以使用此查询执行此操作。 此查询生成从1到1000的数字。 我这样做,所以这个查询将支持最多1000个字符的搜索。

<强>查询

SELECT 
 @row := @row + 1 AS ROW
FROM (
  SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) 
 row1
CROSS JOIN (
  SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
  SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
  SELECT @row := 0
) AS init_user_param

<强>结果

  row  
--------
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
     ...
     ...
     990
     991
     992
     993
     994
     995
     996
     997
     998
     999
    1000

现在,我们将最后一个查询作为已提交的表与DISTINCT SUBSTRING('StackExch_bla_bla_bla', 1, [number])结合使用,以查找唯一的单词列表。

<强>查询

SELECT 
 DISTINCT  
   SUBSTRING('StackExch_bla_bla_bla', 1, rows.row) AS word
FROM (

  SELECT 
   @row := @row + 1 AS ROW
  FROM (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) 
   row1
  CROSS JOIN (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) row2
  CROSS JOIN (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) row3
  CROSS JOIN (
    SELECT @row := 0
  ) AS init_user_param
) ROWS

<强>结果

word                   
-----------------------
S                      
St                     
Sta                    
Stac                   
Stack                  
StackE                 
StackEx                
StackExc               
StackExch              
StackExch_             
StackExch_b            
StackExch_bl           
StackExch_bla          
StackExch_bla_         
StackExch_bla_b        
StackExch_bla_bl       
StackExch_bla_bla      
StackExch_bla_bla_     
StackExch_bla_bla_b    
StackExch_bla_bla_bl   
StackExch_bla_bla_bla  

现在想要加入并使用REPLACE(word_column, word, '')CHAR_LENGTH(REPLACE(word_column, word, ''))来生成列表。

<强>查询

SELECT 
 *
 , REPLACE(word_column, word, '') AS replaced
 , CHAR_LENGTH(REPLACE(word_column, word, '')) chars_afterreplace
FROM (
 SELECT 
   DISTINCT  
     SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
  FROM (

    SELECT 
     @row := @row + 1 AS row_number
    FROM (
      SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
    ) 
     row1
    CROSS JOIN (
      SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
    ) row2
    CROSS JOIN (
      SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
    ) row3
    CROSS JOIN (
      SELECT @row := 0
    ) AS init_user_param
  ) ROWS
) words
INNER JOIN
  word
ON
 word.word_column LIKE CONCAT(words.word, '%')

<强>结果

word        word_column    replaced       chars_afterreplace  
----------  -------------  -------------  --------------------
S           StackExchange  tackExchange                     12
S           StackOverflow  tackOverflow                     12
St          StackExchange  ackExchange                      11
St          StackOverflow  ackOverflow                      11
Sta         StackExchange  ckExchange                       10
Sta         StackOverflow  ckOverflow                       10
Stac        StackExchange  kExchange                         9
Stac        StackOverflow  kOverflow                         9
Stack       StackExchange  Exchange                          8
Stack       StackOverflow  Overflow                          8
StackE      StackExchange  xchange                           7
StackEx     StackExchange  change                            6
StackExc    StackExchange  hange                             5
StackExch   StackExchange  ange                              4
StackExch_  StackExchange  StackExchange                    13

现在我们可以清楚地看到我们希望这个单词具有最低的chars_afterreplace。 所以我们想做ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC LIMIT 1

<强>查询

SELECT 
 word.word_column
FROM (
 SELECT 
   DISTINCT  
     SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
FROM (

  SELECT 
    @row := @row + 1 AS row_number
  FROM (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) 
   row1
  CROSS JOIN (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) row2
  CROSS JOIN (
    SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
  ) row3
  CROSS JOIN (
    SELECT @row := 0
  ) AS init_user_param
) ROWS

) words
INNER JOIN word
ON word.word_column LIKE CONCAT(words.word, '%')
ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC
LIMIT 1

<强>结果

word_column    
---------------
StackExchange  

答案 3 :(得分:0)

以下解决方案需要一个包含从word_column长度到1(至少)长度的序列号的表格。假设word_columnVARCHAR(190),则需要一个包含1到190之间数字的表。如果将MariaDB与序列插件一起使用,则可以使用表seq_1_to_190。如果您没有它,有很多方法可以创建它。一种简单的方法是使用information_schema.columns表:

create table if not exists seq_1_to_190 (seq tinyint unsigned auto_increment primary key)
    select null as seq from information_schema.columns limit 190;

您也可以在子查询中即时创建它,但这会使您的查询复杂化。

我将使用会话变量@word来存储搜索字符串。

set @word = 'StackExch_bla_bla_bla';

但您可以使用常量搜索字符串替换所有出现的内容。

现在我们可以使用序列表用

创建所有前缀子串
select seq as l, left(@word, seq) as substr
from seq_1_to_190 s
where s.seq <= char_length(@word)

http://rextester.com/BWU18001

并在您使用LIKE表格加入时将其用于words条件:

select w.word_column
from (
    select seq as l, left(@word, seq) as substr
    from seq_1_to_190 s
    where s.seq <= char_length(@word)
) s
join words w on w.word_column like concat(replace(s.substr, '_', '\_'), '%')
order by s.l desc
limit 1

http://rextester.com/STQP82942

请注意,_是一个占位符,您需要使用\_在搜索字符串中将其转义。如果你的字符串可以包含%,你还需要这样做,但我会在答案中跳过这一部分。

也可以在没有子查询的情况下编写查询:

select w.word_column
from seq_1_to_190 s
join words w on w.word_column like concat(replace(left(@word, seq), '_', '\_'), '%')
where s.seq <= char_length(@word)
order by s.seq desc
limit 1

http://rextester.com/QVZI59071

这些查询可以完成工作,理论上它们也应该很快。但是MySQL(在我的案例中是MariaDB 10.0.19)创建了一个糟糕的执行计划,并且没有使用ORDER BY子句的索引。两个查询在100K行数据集上运行大约1.8秒。

我可以通过单个查询来提高性能

select (
    select word_column
    from words w
    where w.word_column like concat(replace(left(@word, s.seq), '_', '\_'), '%')
    limit 1
) as word_column
from seq_1_to_190 s
where s.seq <= char_length(@word)
having word_column is not null
order by s.seq desc
limit 1

http://rextester.com/APZHA8471

这个更快,但仍需要670毫秒。请注意,Gordons CASE查询运行时间为125毫秒,但需要完整的表/索引扫描和文件排序。

但是我设法强制引擎使用带有索引临时表的ORDER BY子句的索引:

drop temporary table if exists tmp;
create temporary table tmp(
    id tinyint unsigned auto_increment primary key,
    pattern varchar(190)
) engine=memory
    select null as id, left(@word, seq) as pattern
    from seq_1_to_190 s
    where s.seq <= char_length(@word)
    order by s.seq desc;

select w.word_column
from tmp force index for order by (primary)
join words w 
    on  w.word_column >= tmp.pattern
    and w.word_column <  concat(tmp.pattern, char(127))
order by tmp.id asc
limit 1

http://rextester.com/OOE82089

此查询在我的100K行测试表上是“即时”(小于1毫秒)。如果我删除FORCE INDEX或使用LIKE条件,则会再次变慢。

请注意,char(127)似乎适用于ASCII字符串。您可能需要根据您的角色集找到另一个角色。

毕竟,我必须说我的第一个想法是使用UNION ALL查询,这也是由Gordon Linoff提出的。但是 - 这是一个仅限SQL的解决方案:

set @subquery = '(
    select word_column
    from words
    where word_column like {pattern}
    limit 1
)';

set session group_concat_max_len = 1000000;
set @sql = (
    select group_concat(
        replace(
            @subquery,
            '{pattern}',
            replace(quote(concat(left(@word, seq), '%')), '_', '\_')
        )
        order by s.seq desc
        separator ' union all '
    )
    from seq_1_to_190 s
    where s.seq <= char_length(@word)
);
set @sql = concat(@sql, ' limit 1');

prepare stmt from @sql;
execute stmt;

http://rextester.com/OPTJ37873

它也是“即时”。

如果你喜欢strored的程序/函数 - 这是一个函数:

create function get_with_similar_begin(search_str text) returns text
begin
    declare l integer;
    declare res text;
    declare pattern text;

    set l = char_length(search_str);
    while l > 0 and res is null do
        set pattern = left(search_str, l);
        set pattern = replace(pattern, '_', '\_');
        set pattern = replace(pattern, '%', '\%');
        set pattern = concat(pattern, '%');
        set res = (select word_column from words where word_column like pattern);
        set l = l - 1;
    end while;
    return res;
end

将其用作

select get_with_similar_begin('StackExch_bla_bla_bla');
select get_with_similar_begin('StackO_bla_bla_bla');

http://rextester.com/CJTU4629

这可能是最快的方式。虽然对于长字符串,一种分而治之 algorinthm可能会减少平均查找次数。但也可能只是矫枉过正。

如果你想在大表上测试你的查询 - 我使用以下代码创建我的测试表(对于带有序列插件的MariaDB):

drop table if exists words;
create table words(
    id mediumint auto_increment primary key,
    word_column varchar(190),
    index(word_column)
);

insert into words(word_column)
    select concat('Stack', rand(1)) as word_column
    from seq_1_to_100000;

insert into words(word_column)values('StackOferflow'),('StackExchange'),('MetaStackExchange');