如何找到varchar
- MySQL数据库中指定单词开头最相似的单词?
例如:
+-------------------+
| word_column |
+-------------------+
| StackOferflow |
| StackExchange |
| MetaStackExchange |
| .... |
+-------------------+
查询:call get_with_similar_begin('StackExch_bla_bla_bla');
输出:'StackExchange'
查询:call get_with_similar_begin('StackO_bla_bla_bla');
输出:'StackOferflow'
更新:
Select * from words where word_column like 'StackExch_bla_bla_bla'
无法提供正确的结果,因为'StackExchange'
与此过滤条件不匹配。
其他信息:BTREE-index
上有word_column
,我希望尽可能使用
答案 0 :(得分:2)
在SQL Server中,我们可以像下面的查询一样使用CTE来实现你想要的目标:
declare @search nvarchar(255) = 'StackExch_bla_bla_bla';
-- A cte that contains `StackExch_bla_bla_bla` sub-strings: {`StackExch_bla_bla_bla`, `StackExch_bla_bla_bl`, ..., `S`}
with cte(part, lvl) as (
select @search, 1
union all
select substring(@search, 1, len(@search) - lvl), lvl + 1
from cte
where lvl < len(@search)
), t as ( -- Now below cte will find match level of each word_column
select t.word_column, min(cte.lvl) matchLvl
from yourTable t
left join cte
on t.word_column like cte.part+'%'
group by t.word_column
)
select top(1) word_column
from t
where matchLvl is not null -- remove non-matched rows
order by matchLvl;
我需要更多时间在MySQL中寻找方法,希望一些MySQL专家能够更快地回答;)。
我在MySQL中的最佳尝试是:
select tt.word_column
from (
select t.word_column, min(lvl) matchLvl
from yourTable t
join (
select 'StackExch_bla_bla_bla' part, 1 lvl
union all select 'StackExch_bla_bla_bl', 2
union all select 'StackExch_bla_bla_b', 3
union all select 'StackExch_bla_bla_', 4
union all select 'StackExch_bla_bla', 5
union all select 'StackExch_bla_bl', 6
union all select 'StackExch_bla_b', 7
union all select 'StackExch_bla_', 8
union all select 'StackExch_bla', 9
union all select 'StackExch_bl', 10
union all select 'StackExch_b', 11
union all select 'StackExch_', 12
union all select 'StackExch', 13
union all select 'StackExc', 14
union all select 'StackEx', 15
union all select 'StackE', 16
union all select 'Stack', 17
union all select 'Stac', 18
union all select 'Sta', 19
union all select 'St', 20
union all select 'S', 21
) p on t.word_column like concat(p.part, '%')
group by t.word_column
) tt
order by matchLvl
limit 1;
我认为通过创建存储过程并使用临时表在p
子选择中存储值,您可以实现您想要的--HTH;)。
答案 1 :(得分:2)
@ shA.t的答案略有不同。聚合不是必需的:
select t.*, p.lvl
from yourTable t join
(select 'StackExch_bla_bla_bla' as part, 1 as lvl union all
select 'StackExch_bla_bla_bl', 2 union all
select 'StackExch_bla_bla_b', 3 union all
select 'StackExch_bla_bla_', 4 union all
select 'StackExch_bla_bla', 5 union all
select 'StackExch_bla_bl', 6 union all
select 'StackExch_bla_b', 7 union all
select 'StackExch_bla_', 8 union all
select 'StackExch_bla', 9 union all
select 'StackExch_bl', 10 union all
select 'StackExch_b', 11 union all
select 'StackExch_', 12 union all
select 'StackExch', 13 union all
select 'StackExc', 14 union all
select 'StackEx', 15 union all
select 'StackE', 16 union all
select 'Stack', 17 union all
select 'Stac', 18 union all
select 'Sta', 19 union all
select 'St', 20 union all
select 'S', 21
) p
on t.word_column like concat(p.part, '%')
order by matchLvl
limit 1;
更快捷的方法是使用case
:
select t.*,
(case when t.word_column like concat('StackExch_bla_bla_bla', '%') then 'StackExch_bla_bla_bla'
when t.word_column like concat('StackExch_bla_bla_bl', '%') then 'StackExch_bla_bla_bl'
when t.word_column like concat('StackExch_bla_bla_b', '%') then 'StackExch_bla_bla_b'
. . .
when t.word_column like concat('S', '%') then 'S'
else ''
end) as longest_match
from t
order by length(longest_match) desc
limit 1;
这些都不会有效地使用索引。
如果你想要一个使用索引的版本,那么在应用层进行循环,然后重复运行查询:
select t.*
from t
where t.word_column like 'StackExch_bla_bla_bla%'
limit 1;
然后在第一场比赛中停止。 MySQL应该使用like
比较的索引。
您可以使用union all
:
(select t.*, 'StackExch_bla_bla_bla' as matching
from t
where t.word_column like 'StackExch_bla_bla_bla%'
limit 1
) union all
(select t.*, 'StackExch_bla_bla_bl'
from t
where t.word_column like 'StackExch_bla_bla_bl%'
limit 1
) union all
(select t.*, 'StackExch_bla_bla_b'
from t
where t.word_column like 'StackExch_bla_bla_b%'
limit 1
) union al
. . .
(select t.*, 'S'
from t
where t.word_column like 'S%'
limit 1
)
order by length(matching) desc
limit 1;
答案 2 :(得分:2)
创建表/插入数据。
CREATE DATABASE IF NOT EXISTS stackoverflow;
USE stackoverflow;
DROP TABLE IF EXISTS word;
CREATE TABLE IF NOT EXISTS word(
word_column VARCHAR(255)
, KEY(word_column)
)
;
INSERT INTO word
(`word_column`)
VALUES
('StackOverflow'),
('StackExchange'),
('MetaStackExchange')
;
此解决方案取决于生成大量列表。 我们可以使用此查询执行此操作。 此查询生成从1到1000的数字。 我这样做,所以这个查询将支持最多1000个字符的搜索。
<强>查询强>
SELECT
@row := @row + 1 AS ROW
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
<强>结果强>
row
--------
1
2
3
4
5
6
7
8
9
10
...
...
990
991
992
993
994
995
996
997
998
999
1000
现在,我们将最后一个查询作为已提交的表与DISTINCT SUBSTRING('StackExch_bla_bla_bla', 1, [number])
结合使用,以查找唯一的单词列表。
<强>查询强>
SELECT
DISTINCT
SUBSTRING('StackExch_bla_bla_bla', 1, rows.row) AS word
FROM (
SELECT
@row := @row + 1 AS ROW
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
) ROWS
<强>结果强>
word
-----------------------
S
St
Sta
Stac
Stack
StackE
StackEx
StackExc
StackExch
StackExch_
StackExch_b
StackExch_bl
StackExch_bla
StackExch_bla_
StackExch_bla_b
StackExch_bla_bl
StackExch_bla_bla
StackExch_bla_bla_
StackExch_bla_bla_b
StackExch_bla_bla_bl
StackExch_bla_bla_bla
现在想要加入并使用REPLACE(word_column, word, '')
和CHAR_LENGTH(REPLACE(word_column, word, ''))
来生成列表。
<强>查询强>
SELECT
*
, REPLACE(word_column, word, '') AS replaced
, CHAR_LENGTH(REPLACE(word_column, word, '')) chars_afterreplace
FROM (
SELECT
DISTINCT
SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
FROM (
SELECT
@row := @row + 1 AS row_number
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
) ROWS
) words
INNER JOIN
word
ON
word.word_column LIKE CONCAT(words.word, '%')
<强>结果强>
word word_column replaced chars_afterreplace
---------- ------------- ------------- --------------------
S StackExchange tackExchange 12
S StackOverflow tackOverflow 12
St StackExchange ackExchange 11
St StackOverflow ackOverflow 11
Sta StackExchange ckExchange 10
Sta StackOverflow ckOverflow 10
Stac StackExchange kExchange 9
Stac StackOverflow kOverflow 9
Stack StackExchange Exchange 8
Stack StackOverflow Overflow 8
StackE StackExchange xchange 7
StackEx StackExchange change 6
StackExc StackExchange hange 5
StackExch StackExchange ange 4
StackExch_ StackExchange StackExchange 13
现在我们可以清楚地看到我们希望这个单词具有最低的chars_afterreplace。
所以我们想做ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC
LIMIT 1
<强>查询强>
SELECT
word.word_column
FROM (
SELECT
DISTINCT
SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
FROM (
SELECT
@row := @row + 1 AS row_number
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
) ROWS
) words
INNER JOIN word
ON word.word_column LIKE CONCAT(words.word, '%')
ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC
LIMIT 1
<强>结果
word_column
---------------
StackExchange
答案 3 :(得分:0)
以下解决方案需要一个包含从word_column
长度到1(至少)长度的序列号的表格。假设word_column
为VARCHAR(190)
,则需要一个包含1到190之间数字的表。如果将MariaDB与序列插件一起使用,则可以使用表seq_1_to_190
。如果您没有它,有很多方法可以创建它。一种简单的方法是使用information_schema.columns
表:
create table if not exists seq_1_to_190 (seq tinyint unsigned auto_increment primary key)
select null as seq from information_schema.columns limit 190;
您也可以在子查询中即时创建它,但这会使您的查询复杂化。
我将使用会话变量@word
来存储搜索字符串。
set @word = 'StackExch_bla_bla_bla';
但您可以使用常量搜索字符串替换所有出现的内容。
现在我们可以使用序列表用
创建所有前缀子串select seq as l, left(@word, seq) as substr
from seq_1_to_190 s
where s.seq <= char_length(@word)
并在您使用LIKE
表格加入时将其用于words
条件:
select w.word_column
from (
select seq as l, left(@word, seq) as substr
from seq_1_to_190 s
where s.seq <= char_length(@word)
) s
join words w on w.word_column like concat(replace(s.substr, '_', '\_'), '%')
order by s.l desc
limit 1
http://rextester.com/STQP82942
请注意,_
是一个占位符,您需要使用\_
在搜索字符串中将其转义。如果你的字符串可以包含%
,你还需要这样做,但我会在答案中跳过这一部分。
也可以在没有子查询的情况下编写查询:
select w.word_column
from seq_1_to_190 s
join words w on w.word_column like concat(replace(left(@word, seq), '_', '\_'), '%')
where s.seq <= char_length(@word)
order by s.seq desc
limit 1
http://rextester.com/QVZI59071
这些查询可以完成工作,理论上它们也应该很快。但是MySQL(在我的案例中是MariaDB 10.0.19)创建了一个糟糕的执行计划,并且没有使用ORDER BY
子句的索引。两个查询在100K行数据集上运行大约1.8秒。
我可以通过单个查询来提高性能
select (
select word_column
from words w
where w.word_column like concat(replace(left(@word, s.seq), '_', '\_'), '%')
limit 1
) as word_column
from seq_1_to_190 s
where s.seq <= char_length(@word)
having word_column is not null
order by s.seq desc
limit 1
http://rextester.com/APZHA8471
这个更快,但仍需要670毫秒。请注意,Gordons CASE查询运行时间为125毫秒,但需要完整的表/索引扫描和文件排序。
但是我设法强制引擎使用带有索引临时表的ORDER BY
子句的索引:
drop temporary table if exists tmp;
create temporary table tmp(
id tinyint unsigned auto_increment primary key,
pattern varchar(190)
) engine=memory
select null as id, left(@word, seq) as pattern
from seq_1_to_190 s
where s.seq <= char_length(@word)
order by s.seq desc;
select w.word_column
from tmp force index for order by (primary)
join words w
on w.word_column >= tmp.pattern
and w.word_column < concat(tmp.pattern, char(127))
order by tmp.id asc
limit 1
此查询在我的100K行测试表上是“即时”(小于1毫秒)。如果我删除FORCE INDEX
或使用LIKE
条件,则会再次变慢。
请注意,char(127)
似乎适用于ASCII字符串。您可能需要根据您的角色集找到另一个角色。
毕竟,我必须说我的第一个想法是使用UNION ALL
查询,这也是由Gordon Linoff提出的。但是 - 这是一个仅限SQL的解决方案:
set @subquery = '(
select word_column
from words
where word_column like {pattern}
limit 1
)';
set session group_concat_max_len = 1000000;
set @sql = (
select group_concat(
replace(
@subquery,
'{pattern}',
replace(quote(concat(left(@word, seq), '%')), '_', '\_')
)
order by s.seq desc
separator ' union all '
)
from seq_1_to_190 s
where s.seq <= char_length(@word)
);
set @sql = concat(@sql, ' limit 1');
prepare stmt from @sql;
execute stmt;
http://rextester.com/OPTJ37873
它也是“即时”。
如果你喜欢strored的程序/函数 - 这是一个函数:
create function get_with_similar_begin(search_str text) returns text
begin
declare l integer;
declare res text;
declare pattern text;
set l = char_length(search_str);
while l > 0 and res is null do
set pattern = left(search_str, l);
set pattern = replace(pattern, '_', '\_');
set pattern = replace(pattern, '%', '\%');
set pattern = concat(pattern, '%');
set res = (select word_column from words where word_column like pattern);
set l = l - 1;
end while;
return res;
end
将其用作
select get_with_similar_begin('StackExch_bla_bla_bla');
select get_with_similar_begin('StackO_bla_bla_bla');
这可能是最快的方式。虽然对于长字符串,一种分而治之 algorinthm可能会减少平均查找次数。但也可能只是矫枉过正。
如果你想在大表上测试你的查询 - 我使用以下代码创建我的测试表(对于带有序列插件的MariaDB):
drop table if exists words;
create table words(
id mediumint auto_increment primary key,
word_column varchar(190),
index(word_column)
);
insert into words(word_column)
select concat('Stack', rand(1)) as word_column
from seq_1_to_100000;
insert into words(word_column)values('StackOferflow'),('StackExchange'),('MetaStackExchange');