如何计算O(n)中后续子字符串中的匹配字符

时间:2014-09-27 14:20:01

标签: string algorithm pattern-matching

如何计算O(n)中后续子串中的匹配字符。通过从一开始就删除一个字符来形成子字符串。

例如:给定的字符串为ababcabab,预期结果为8

  • Substr1:babcabab数:0

  • Substr2:abcabab计数:2,因为前两个字符与给定的原始字符串匹配,第三个字符不匹配,所以检查匹配是否已停止

  • Substr3:bcabab数:0

  • SubStr4:cabab数:0

  • SubStr5:abab数:4

  • SubStr6:bab数:0

  • Substr7:ab数:2

  • SubStr8:b数:0

预期结果:2 + 4 + 2 = 8

3 个答案:

答案 0 :(得分:4)

您可以使用Ukkonen's algorithm在O(n)中创建后缀数组(和LCP数组),然后使用O(n)中的另一个传递找到它,在原始字符串周围求和LCP值变得微不足道:

    LCP SA  suffix
    0   9   .
    0   7   ab.
>   2   5   abab.
>   4   0   ababcabab.
>   2   2   abcabab.
    0   8   b.
    1   6   bab.
    3   1   babcabab.
    1   3   bcabab.
    0   4   cabab.
    0   0   ababcabab.

答案 1 :(得分:0)

使用for循环(本例中为java):

String s = "ababcabab";
int count = 0;
    int count = 0;
    for(int i = 1; i < s.length(); i++){ // for loop for all substrings [EDIT]: starts w/ 1 instead of 0. Thanks to vincent
        String sub = s.substring(i);
        for(int j = 0; j < sub.length() && sub.toCharArray()[j] == s.toCharArray()[j]; j++) /note that for & while loops in java are very similar. stops when substring doesn't match anymore **OR** substring's end is reached
        {
            count++; // increases count for every matching char in substring in a row
        }
    }
    System.out.println("The count is: " + count);

答案 2 :(得分:0)

我们可以通过得出一些合乎逻辑的结论在O(n)中解决这个问题:因为所有匹配都相同;也就是说,它们匹配字符串本身;从字符串的索引i开始的任何匹配都将包含在i之前开始的所有匹配(或长度允许的部分)。此外,其长度大于其起始索引的任何匹配将包括重复字符串开头部分到匹配开始。我们只需要完整记录我们可以在一次遍历字符串时找到的匹配而不退后,并推断其余部分。

示例(非零基础):

"aaaaaa":
Starting on index 2, we have a match length 5. This match necessarily includes
a match of length 4 starting on index 3 (since index 3 is index 2 for the
substring that starts on index 2). Continuing the same logic, we add 3 + 2 + 1
for a total of 15, without needing to scan and compare more than Substr2.

"aabaabaa":
Starting on index 2, we have a match length 1.
Starting on index 4, we have a match length 5. This match necessarily includes
a match of length 1 starting on index 5 (since index 5 is index 2 for the
substring that starts on index 4). It also necessarily includes a match of 
length (5 - 3) starting on index 7 (since index 7 is index 4 for the substring
that starts on index 4), and this match implies another match of length 1, 
starting on index 8. Altogether 1 + 5 + 1 + (5 - 3) + 1 = 10. Again, the scan
was O(n).

"aabaabaabaabaa":
Starting on index 2, we have a match length 1.
Starting on index 4, we have a match length 11.
1 + 11 + 1 + (11 - 3) + 1 + (8 - 3) + 1 + (5 - 3) + 1 = 31.

"aabaaab":
Starting on index 2, we have a match length 1.
For repeated patterns in the beginning of the string, we can use a formula 
rather than multiple scans, so a string like "aabaaaaaaaaaab" would have the 
same complexity as the one above, (number of times the pattern repeats - number
of times the pattern repeats in the beginning of the string) * total length of
repeated pattern at the start of the string. We identify a pattern if the 
length of the first match is a multiple of its starting index. Identifying 
this pattern and using the formula also prevents erroneously missing the 
correct match to record (e.g., without it, we would have identified 'aa' and 
'a' at the end as matches and missed the 'aab'). 
So starting on index 4, we have (3 - 2) * 2 = 2
Starting on index 5, we have a match length 3.
1 + 2 + 3 + 1 = 7

"ababcabab":
Starting on index 3, we have a match length 2.
Starting on index 6, we have a match length 4.
2 + 4 + 2 = 8