在二进制字符串中查找模式

时间:2013-05-10 03:50:31

标签: javascript algorithm pattern-matching

我正在尝试在二进制数字字符串中找到重复模式。

例如。 0010010010或1110111011 = ok

不。 0100101101 =糟糕

字符串长10位(如上所述)&我想'模式'的2次迭代是最小的。

我开始手动设置程序可以匹配的'银行'模式但是必须有更好的方法使用算法?

搜索让我无处可去 - 我认为语言&我正在使用的术语不正确..

7 个答案:

答案 0 :(得分:2)

相当挑战。这个功能怎么样?

function findPattern(n) {
    var maxlen = parseInt(n.length/2);
    NEXT:
    for(var i=1; i<=maxlen; ++i) {
        var len=0, k=0, prev="", sub;
        do {
            sub = n.substring(k,k+i);
            k+= i;
            len = sub.length;
            if(len!=i) break;
            if(prev.length && sub.length==i && prev!=sub) continue NEXT;
            if(!prev.length) prev = sub;
        } while(sub.length);
        var trail = n.substr(n.length-len);
        if(!len || len && trail==n.substr(0,len)) return n.substr(0,i);
    }
    return false;
}

它甚至适用于任何内容的任何长度字符串。见the fiddle

受Jack和Zim-Zam的回答启发,以下是强力算法列表:

var oksubs =
["001","010","011","100","101","110",
"0001","0010","0011","0100","0101","0110","0111",
"1000","1001","1010","1011","1100","1101","1110",
"00000","00001","00011","00101","00110","00111","01000",
"01001","01010","01011","01100","01101","01110","01111",
"10000","10001","10011","10101","10110","10111","11000","11001",
"11010","11011","11100","11101","11110","11111"];

感谢Jack的评论,这里有简短而有效的代码:

function findPattern(n) {
    var oksubs = [n.substr(0,5),n.substr(0,4),n.substr(0,3)];
    for(var i=0; i<oksubs.length; ++i) {
        var sub = oksubs[i];
        if((sub+sub+sub+sub).substr(0,10)==n) return sub;
    }
    return false;
}

答案 1 :(得分:1)

你只有2 ^ 10个模式,这是一个足够小的数字,你可以预先计算所有有效字符串并将结果存储在1024元素的布尔数组中;如果字符串有效,则将其转换为整数(例如“0000001111”= 15)并在结果数组索引中存储“true”。要检查字符串是否有效,请将其转换为整数并在预先计算的布尔数组中查找索引。

如果你的字符串长度超过10位,那么你需要更聪明地确定一个字符串是否有效,但是因为你只有1024个字符串,所以你也可能对此很懒。

答案 2 :(得分:1)

我的蛮力方法是:

以示例

  1. givenString:0010010010

  2. 为givenString 0010010010创建可能的模式列表:

    possiblePatterns = [00, 010, 0010, 00100, 01, 001, 0100, 10, 100]
    
  3. 重复它们以制作长度为&gt; = 10

    的字符串
    testPattern0 = 0000000000    // 00 00 00 00 00
    testPattern1 = 010010010010  // 010 010 010 010
    testPattern2 = 001000100010  // 0010 0010 0010
    ...
    
  4. 并检查......

    for each testPattern:
        if '0010010010' is a substring of testPattern ==> pattern found
    

    其中一个匹配的字符串:

    testPattern2: 010010010010
    givenString :   0010010010
    
  5. 找到模式:

    foundPatterns = [010, 001, 100]
    
  6. 可以看出,这是一个可能是冗余的列表,因为所有模式基本相同,只是移位。但根据用例,这可能实际上就是你想要的。


    代码:

    function findPatterns(str){
        var len = str.length;
        var possible_patterns = {};  // save as keys to prevent duplicates
        var testPattern;
        var foundPatterns = [];
    
        // 1) create collection of possible patterns where:
        //    1 < possiblePattern.length <= str.length/2
        for(var i = 0; i <= len/2; i++){
            for(var j = i+2; j <= len/2; j++){
                possible_patterns[str.substring(i, j)] = 0;
            }
        }
    
        // 2) create testPattern to test against given str where:
        //    testPattern.length >= str.length
        for(var pattern in possible_patterns){
            testPattern = new Array(Math.ceil(len/pattern.length)+1).join(pattern);
            if(testPattern.indexOf(str) >= 0){
                foundPatterns.push(pattern);
            }
        }
        return foundPatterns;
    }
    

    ==&GT; fiddle

答案 3 :(得分:1)

  1. 维护一个2 ^ 10的数组不会有帮助,因为它不会指示哪些字符串有重复的模式。
  2. 要有重复图案,图案长度只能是<= 5

  3. 可以有长度为1的图案。但是长度为5的图案将覆盖它。 [STEP EDITED]

  4. 如果有长度为2的图案,则总是有一个长度为4的图案。

  5. 从(1),(2),(3)和(4),只需要检查长度为3,4和5的模式

  6. 这意味着如果前三个数字与下三个数字匹配,则继续直到字符串结束,否则转到7

  7. 如果匹配继续直到字符串结尾,则匹配前四位数与下一位数 否则打破并转到8

  8. 如果匹配继续直到字符串结尾,则匹配前四个数字和下一个四个数字 否则打破并转到9

  9. 如果6,7,8中的一个为假,则返回失败

答案 4 :(得分:1)

据我所知,有62个图案化的二进制字符串,长度为10 =&gt; 2^1 + 2^2 + 2^3 + 2^4 + 2^5。这里列出了一些代码并匹配带图案的字符串:

function binComb (n){
  var answer = []
  for (var i=0; i<Math.pow(2,n);i++){
    var str = i.toString(2)
    for (var j=str.length; j<n; j++){
      str = "0" + str
    }
    answer.push(str)
  }
  return answer
}

function allCycles(){
  var answer = {}, cycled = ""
  for (var i=1; i<=5; i++){
    var arr = binComb(i)
    for (var j=0; j<arr.length; j++){
      while(cycled.length < 10){
        cycled += arr[j]
        if (10 - cycled.length < arr[j].length)
          cycled += arr[j].substr(0,10 - cycled.length)
      }
      if (answer[cycled]) 
        answer[cycled].push(arr[j])
      else answer[cycled] = [arr[j]]
      cycled = ""
    }
  }
  return answer
}

function getPattern (str){
  var patterns = allCycles()
  if (patterns[str]) 
    return patterns[str]
  else return "Pattern not found."
}

输出:

console.log(getPattern("0010010010"))
console.log(getPattern("1110111011"))
console.log(getPattern("0100101101"))
console.log(getPattern("1111111111"))

["001"]
["1110"]
Pattern not found.
["1", "11", "111", "1111", "11111"]

答案 5 :(得分:0)

这个答案使用一个Python正则表达式来编译VERBOSE标志集,它允许带有注释和非重要空格的多行正则表达式。我还在正则表达式中使用命名组。

regexp是在Kodos工具的帮助下开发的。

正则表达式搜索长度为5,然后是4个然后是3个重复字符的最长重复字符串。 (两个重复的字符是冗余的,因为它等于较长的四个;并且类似地,一个重复被冗余为五个。)

import re

rawstr = r"""
(?P<r5> .{5})    (?P=r5) |
(?P<r4>(?P<_42> .{2}) .{2})   (?P=r4)   (?P=_42) |
(?P<r3>(?P<_31> .{1}) .{2})   (?P=r3){2}   (?P=_31) 
"""

matchstr = """\
1001110011
1110111011
0010010010
1010101010
1111111111
0100101101
"""

for matchobj in re.finditer(rawstr, matchstr, re.VERBOSE):
    grp, rep = [(g, r) for g, r in matchobj.groupdict().items()
                   if g[0] != '_' and r is not None][0]
    print('%r is a repetition of %r' % (matchobj.group().strip(), rep))

给出输出:

'1001110011' is a repetition of '10011'
'1110111011' is a repetition of '1110'
'0010010010' is a repetition of '001'
'1010101010' is a repetition of '1010'
'1111111111' is a repetition of '11111'

答案 6 :(得分:0)

在Python中(再次)但没有正则表达式:

def is_repeated(text):
    'check if the first part of the string is repeated throughout the string'
    len_text = len(text)
    for rep_len in range(len_text // 2,  0, -1):
        reps = (len_text + rep_len) // rep_len
        if (text[:rep_len] * reps).startswith(text):
            return rep_len  # equivalent to boolean True as will not be zero
    return 0     # equivalent to boolean False

matchstr = """\
1001110011
1110111011
0010010010
1010101010
1111111111
0100101101
"""
for line in matchstr.split():
    print('%r has a repetition length of %i' % (line, is_repeated(line)))

输出:

'1001110011' has a repetition length of 5
'1110111011' has a repetition length of 4
'0010010010' has a repetition length of 3
'1010101010' has a repetition length of 4
'1111111111' has a repetition length of 5
'0100101101' has a repetition length of 0