我写了一个应该很简单的SQL查询,事实并非如此简单。我有一个120万字(几种语言)+更多的数据库。我的内心问我用字母jxtehmrungce中的5个字母可以说出多少单词。然后我决定进行测试。好吧,事实证明,编写这样的查询很容易。但是!〜必须有一个更简单的解决方案?字符越多,查询的时间就越长。
下面,按字母顺序循环显示所有字符(字母)
SELECT count(DISTINCT `word`) as `numrows` FROM `words` WHERE LENGTH(`word`) = '5' AND `chars` REGEXP ' ([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])+([g{0,1}+]|[i{0,1}+]|[l{0,1}+]|[m{0,1}+]|[n{0,1}+]|[o{0,1}+]|[r{0,1}+]|[t{0,1}+]|[u{0,1}+]|[x{0,1}+])' AND `verified` = '1'
我会在yougowords.com上使用这个用于针对390万行表的解码工具,如果它运行良好,但这是一个非常耗时的查询。我怎样才能改善这个?可能有几个正则表达式,但是,如果您将字符集更改为具有双字母或三字母的字符,例如添加额外的j,g,h或添加更多字母等.jjtehhmrungcs
修改 - 没有重复的字符,因此为什么你会看到2个字符,而不是2个字符。 (jxtehmrungce)
我没有SQL经验,我的基础是我自己有限的知识。
字符列 对于不同的程序,我创建了字符列,用于字母中所有字母的字母组织。因此,“生命”这个词的顺序就是efil,而“happy”这个词就会变得更加健康。我可以使用其中一个或列来获得与此查询相同的结果,但是chars列按顺序放置字符,因此jxtehmrungce变为ceeghjmnrtux。可以帮助找到只有“2”的单词吗?
答案 0 :(得分:2)
这样做你想要的吗?
select count(distinct word)
from words w
where word regex '[jxtehmrungce]{5}' and verified = '1';
或者您正在寻找五个角色的排列吗?
编辑:
如果您仅限于列表中的字符,那么您的查询会更复杂。我会采用生成所有可能组合的方法,然后查看它们是否是单词:
create view i_c as
select 1 as i, 'j' as c union all
select 2 as i, 'x' as c union all
select 3 as i, 't' as c union all
select 4 as i, 'e' as c union all
select 5 as i, 'h' as c union all
select 6 as i, 'm' as c union all
select 7 as i, 'r' as c union all
select 8 as i, 'u' as c union all
select 9 as i, 'n' as c union all
select 10 as i, 'g' as c union all
select 11 as i, 'c' as c union all
select 12 as i, 'e' as c;
select count(distinct w.word)
from i_c c1 join
i_c c2
on c2.i not in (c1.i) join
i_c c3
on c3.i not in (c1.i, c2.i) join
i_c c4
on c4.i not in (c1.i, c2.i, c3.i) join
i_c c4
on c5.i not in (c1.i, c2.i, c3.i, c4.i) join
words w
on concat(c1.c, c2.c, c3.c, c4.c, c5.c) = w.word
where w.verified = 1;
答案 1 :(得分:1)
基于Gordon上面的精彩答案,您可以创建一个临时表来存储每个char应该出现的每个char和maxCount,然后在where子句中使用NOT EXISTS
子查询检查每个字母是否显示不超过maxCount。我没有安装MySQL来测试它,但我的SQL Server版本的查询工作正常,我想我已经将所有语法正确转换为MySQL。
CREATE TEMPORARY TABLE chars(letter char(1) not null, maxCount int not null);
INSERT INTO chars(letter, maxCount)
VALUES ('j',1),('x',1),('t',1),('e',2),('h',1),('m',1),('r',1),('u',1),('n',1),('g',1),('c',1)
;
select count(distinct word)
from words w
where LENGTH(word) = 5 and word regexp '[jxtehmrungce]{5}' and verified = '1'
and not exists(
select 1
from chars c
--This checks how many times each character occurs in the word.
--Ex: 'asdfee' has len = 6, if i replace the e's, then it becomes 'asdf' len = 4, 6 - 4 = 2
where length(w.word) - length(replace(w.word, c.letter, '')) > c.maxCount
)
;
这是一个SQL小提琴演示:http://sqlfiddle.com/#!2/b0da4/2
您还可以检查使用GROUP_CONCAT(https://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_group-concat)以使正则表达式模式动态化。下面是一个动态示例,它基于char临时表中的值和为@targetWordLen变量设置的值。这样可以轻松地将新字符添加到列表中并更改目标字长。
动态版的SQL小提琴演示:http://sqlfiddle.com/#!2/b0da4/29
SET @targetWordLen := 5;
set @regExPattern := concat('[',(select group_concat(letter SEPARATOR '') from chars),']{', @targetWordLen, '}');
select count(distinct w.word)
from words w
where LENGTH(word) = @targetWordLen
and w.word regexp @regExPattern
and w.verified = 1
and not exists(
select *
from chars c
where length(w.word) - length(replace(w.word, c.letter, '')) > c.maxCount
)
;