Question

有没有人知道在一块最优的二进制数据块中检测37位序列的优化方法。当然我可以使用窗口进行强力比较（只需从索引0开始比较+接下来的36位，递增和循环直到我找到它）但是有更好的方法吗？也许某些哈希搜索会返回序列位于二进制块内的概率？或者我只是把它拉出我的屁股？无论如何，我正在进行蛮力搜索，但我很好奇是否有更优化的东西。顺便说一句，这是在C语言中。

Answer 1

您可以将这些位视为来自{0,1}字母表的字符，并对数据运行any of several相对有效的已知子字符串搜索算法。

Answer 2

有趣的问题。我假设你的37位序列可以从一个字节中的任何一点开始。假设您的序列由此表示：

ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@

如果我们有字节对齐算法，我们可以看到这些32位序列字节：

BCDEFGHIJKLMNOPQRSTUVWXYZ0123456 [call this pattern w_A]
CDEFGHIJKLMNOPQRSTUVWXYZ01234567 [w_B, etc.]
DEFGHIJKLMNOPQRSTUVWXYZ012345678
EFGHIJKLMNOPQRSTUVWXYZ0123456789
FGHIJKLMNOPQRSTUVWXYZ0123456789@
GHIJKLMNOPQRSTUVWXYZ0123456789@x
HIJKLMNOPQRSTUVWXYZ0123456789@xx
IJKLMNOPQRSTUVWXYZ0123456789@xxx

只有这些字节值 - 没有其他字节值 - 可以形成包含37位感兴趣的字节序列的第二个第三和第四个字节。

这导致了一个相当明显的实施：

unsigned char *p = ...; // input data
size_t n = ...;  // bytes available
size_t bitpos;

--n; p++;
bitpos = 0;

while (n--) {
  uint32_t word = *(uint32_t*)p; // nonportable, sorry.
  bitpos += 8; // compiler should be able to optimise this variable out completely

  if (word == w_A) {
    if ((p[4] & 0xF0 == 789@) && (p[-1] & 1 == A)) {
      // we found the data starting at the 8th bit of p-1
      found_at(bitpos-1);
    }
  } else if (word == w_B) {
    if ((p[4] & 0xE0 == 89@) && (p[-1] & 3 == AB)) {
      // we found the data starting at the 7th bit of p-1
      found_at(bitpos-2);
    }
  } else if (word == w_C} {
     ...
  }
...
}

显然这个策略存在问题。首先，它可能想要在循环周围第一次评估p [-1]，但这很容易解决。其次，它从奇数地址中取出一个字;这对某些CPU不起作用 - 例如SPARC和68k。但这样做是将4个比较合二为一的简单方法。

kek444的建议允许您使用像KMP这样的算法在数据流中向前跳过。但是，跳过的最大大小并不大，因此虽然Turbo Boyer-Moore算法可能会将字节比较的数量减少4个左右，但如果字节比较的成本与之类似，那么这可能不会太大。单词比较的代价。

Answer 3

如果要分析模式的前N位，那么根据前M位确定从哪个位继续模式搜索应该不难，这些M位当然不能成为模式的一部分（如果模式是这样可以确定）。

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
<--            N bits           -->
<--   'ugly' M bits    -->|<-- continue here

这应该缩短一点。

当然，最有效的方法之一是使用类似DFA的状态机来解析输入，但这似乎是一种过度杀伤力。取决于您的使用场景。

Answer 4

如果您正在寻找的模式是固定的，您可以构建一系列数组，这些数组是掩码中的移位以进行比较。要进行比较，请使用xor函数，如果返回0，则匹配。任何其他值都不匹配。这将允许检查字符串中的字节，只要数组中至少剩下2个字节。剩下2个字节，您将无法增加完整的8位。以下是17位的示例，但是相同的想法。（我正在寻找所有的，因为它很容易用于移动位以进行演示）

/* Data is passed in, and offset is the number of bits offset from the first
   bit where the mask is located
   returns true if match was found.
*/
bool checkData(char* data, int* offset)
{
    /* Mask to mask off the first bits  not being used or examined*/
    static char firstMask[8] = { 0xFF, 0x7F, 0x3F, 0x1F, 0x0F, 0x07, 0x03, 0x01 };

    /* Mask to mask off the end bits not used  or examined*/
    static char endMask[8] = { 0x80, 0xC0, 0xE0, 0x0F, 0xF8, 0xFC, 0xFE, 0xFF };

    /* Pattern which is being search, with each row being the about shifted and 
       columns contain the pattern to be compared.  for example index 0 is a 
       shift of 0 bits in the pattern and 7 is a shift of seven bits
       NOTE: Bits not being used are set to zero.  
    */
    static char pattern[8][3] = { { 0xFF, 0xFF, 0x80 },  /* Original pattern */
                                  { 0x8F, 0xFF, 0xC0 },  /* Shifted by one */
                                  { 0x3F, 0xFF, 0xE0 },  /* Shifted by two */
                                  { 0x1F, 0xFF, 0xF0 },
                                  { 0x0F, 0xFF, 0xF8 },
                                  { 0x07, 0xFF, 0xFC },
                                  { 0x03, 0xFF, 0xFE },
                                  { 0x01, 0xFF, 0xFF }}; /* shifted by seven */

    /* outer loop control variable */
    int lcv;

    /* inter loop control variable */
    int lcv2;

    /* value to to contain the value results */
    char value;

    /* if there is no match, pass back a negative number to indicate no match */
    *offset = -1;

    /* Loop through the shifted patterns looking for a match */
    for ( lcv = 0; lcv < 8 ; lcv++ ) 
    {
        /* check the first part of the pattern.  
           mask of part that is not to be check and xor it with the 
           first part of the pattern */

        value = (firstMask[lcv] & *data) ^ pattern[lcv][0];
        /* if value is not zero, no match, so goto the next */
        if ( 0 != value ) 
        {
            continue;
        }

        /* loop through the middle of the pattern make sure it matches
           if it does not, break the loop
           NOTE:  Adjust the condition to match 1 less then the number 
                  of 8 bit items  you are comparing
        */
        for ( lcv2 = 1; lcv2 < 2; lcv2++)
        {
            if ( 0 != (*(data+lcv2)^pattern[lcv][lcv2]))
            {
                break;
            }
        }

        /* if the end of the loop was not reached, pattern 
           does not match, to continue to the next one
           NOTE: See note above about the condition 
        */   
        if ( 2 != lcv2)
        {
          continue;
        }

        /* Check the end of the pattern to see if there is a match after masking
           off the bits which are not being checked.
        */  
        value = (*(data + lcv2) & endMask[lcv]) ^ pattern[lcv][lcv2];

        /* if value is not zero, no match so continue */
        if ( 0 != value ) 
        {
          continue;
        }
    }
    /* If the end of the loop was not reached, set the offset as it 
       is the number of bits the pattern is offset in the byte and 
       return true
    */
    if ( lcv < 8 ) 
    {
        *offset = lcv ;
        return true;
    }
    /* No match was found */
    return false;
}

这要求用户提供指向数据的指针并为下一个字节调用它。用户需要确保它们不会在模式匹配中运行数据的末尾。

在模式的早期没有匹配，它不会继续检查其余的位，这应该有助于搜索时间。

此实现应该相当便携，但需要对37位进行一些返工。

Answer 5

给定任何字节B，您想询问它在37位序列中可能占用的位置（如果有的话）。然后

您为当前字节保留一组可能的位置，该位置开始为空。
如果您看到一个位置为0的字节，则向该组添加0。
如果您看到位置为1..7的字节，则屏蔽并检查前一个字节，如果没有，则将当前位置添加到该集合中。
要移动到新字节，请检查集合中的每个位置，向其中添加8，然后询问新字节是否可以出现在该位置。当你到达至少29岁的位置时，你已经赢得了一次成功搜索的全额费用之旅。

虽然要使用的确切数据结构可供实验使用，但您可以快速通过表查找。由于你有256个字节和8个初始位置，你可以将初始位置存储在256字节的数组中，希望所有0的常见情况是频繁的。这应该使步骤2和3的成本为 O（1）或 O（8），在任何一种情况下都是一个小常量。

对于后面的位置检查，我认为你想要按位置而不是按字节索引，所以你需要一个29字节的数组（每个位置一个来自8..36）。此检查是 O（1）乘以当前活动位置的数量。

这看起来很有趣;让我们知道你是如何做出来的。

Answer 6

这是一种方法：找到你的单个目标位串可以减少到找到任何一组字节串。这个问题有快速的方法，例如Aho-Corasick，它在一次传递中搜索，永远不会在时间上进行搜索，而与目标集的大小无关。

（字节串的集合等于位串的8个移位中的每一个，填充了第一个和最后一个字节所需的所有可能的填充位。我认为其中有1024个。）

二进制序列检测器

6 个答案: