给定MySQL
中的一列字符串(密码)并给出值N
,我正在寻找一种sql方式来计算每个n-gram的频率(子字符串)长度n)。
将代码保存在MySQL中非常重要,因为在我拥有的其他环境中,会导致内存溢出。
我发现同时唯一的工作方法是假设字符串的长度有限(合理假设),select
分别通过提取不同的位置子串,union
然后group by
和{{ 1}},就像这样(对于13个字符中的9克):
count
结果如下:
Select nueve, count(*) as density, avg(location) as avgloc From (select mid(pass, 1, 9) as nueve, 1 as location from passdata where length(pass) >= 9 and length(pass) <= 13 UNION ALL select mid(pass, 2, 9), 2 as location from passdata where length(pass) >= 10 and length(pass) <= 13 UNION ALL select mid(pass, 3, 9), 3 as location from passdata where length(pass) >= 11 and length(pass) <= 13 UNION ALL select mid(pass, 4, 9), 4 as location from passdata where length(pass) >= 12 and length(pass) <= 13 UNION ALL select mid(pass, 5, 9), 5 as location from passdata where length(pass) = 13) as nueves group by nueve order by density DESC
其中nueve density avgloc
123456789 1387 2.4564
234567890 193 2.7306
987654321 141 2.0355
password1 111 1.7748
123123123 92 1.913
liverpool 89 1.618
111111111 86 2.2791
是9克,nueve
是出现次数,density
是字符串中的平均起始位置
有任何改进查询的建议吗?我也为其他n-gram做同样的事。
谢谢!
答案 0 :(得分:1)
创建一个表,其中包含从1到最大密码长度的所有数字。然后,您可以与此联接以获取子字符串位置。
SELECT nueve, COUNT(*) AS density, AVG(location) as avgloc
FROM (
SELECT MID(p.pass, n.num, @N) AS nueve, n.num AS location
FROM passdata AS p
JOIN numbers_table AS n ON LENGTH(p.pass) >= (@N + n.num - 1)
) AS x
GROUP BY nueve
ORDER BY density DESC