MySQL单列n-gram拆分和计数

时间:2017-06-26 20:21:38

标签: mysql query-optimization union n-gram

给定MySQL中的一列字符串(密码)并给出值N,我正在寻找一种sql方式来计算每个n-gram的频率(子字符串)长度n)。

将代码保存在MySQL中非常重要,因为在我拥有的其他环境中,会导致内存溢出。

我发现同时唯一的工作方法是假设字符串的长度有限(合理假设),select分别通过提取不同的位置子串,union然后group by和{{ 1}},就像这样(对于13个字符中的9克):

count

结果如下:

Select 
    nueve,
    count(*) as density,
    avg(location) as avgloc

From
    (select 
        mid(pass, 1, 9) as nueve, 1 as location
    from
        passdata
    where
        length(pass) >= 9 and length(pass) <= 13 UNION ALL select 
        mid(pass, 2, 9), 2 as location
    from
        passdata
    where
        length(pass) >= 10 and length(pass) <= 13 UNION ALL select 
        mid(pass, 3, 9), 3 as location
    from
        passdata
    where
        length(pass) >= 11 and length(pass) <= 13 UNION ALL select 
        mid(pass, 4, 9), 4 as location
    from
        passdata
    where
        length(pass) >= 12 and length(pass) <= 13 UNION ALL select 
        mid(pass, 5, 9), 5 as location
    from
        passdata
    where
        length(pass) = 13) as nueves
group by nueve
order by density DESC

其中nueve density avgloc 123456789 1387 2.4564 234567890 193 2.7306 987654321 141 2.0355 password1 111 1.7748 123123123 92 1.913 liverpool 89 1.618 111111111 86 2.2791 是9克,nueve是出现次数,density是字符串中的平均起始位置

有任何改进查询的建议吗?我也为其他n-gram做同样的事。

谢谢!

1 个答案:

答案 0 :(得分:1)

创建一个表,其中包含从1到最大密码长度的所有数字。然后,您可以与此联接以获取子字符串位置。

SELECT nueve, COUNT(*) AS density, AVG(location) as avgloc
FROM (
    SELECT MID(p.pass, n.num, @N) AS nueve, n.num AS location
    FROM passdata AS p
    JOIN numbers_table AS n ON LENGTH(p.pass) >= (@N + n.num - 1)
) AS x
GROUP BY nueve
ORDER BY density DESC