Question

最近我参加了一个黑客马拉松，我开始了解一个试图在二维矩阵中找到网格形式的问题。一个模式可能是U，H和T，将由3 *表示3矩阵假设我想要呈现H和U

+--+--+--+            +--+--+--+
|1 |0 |1 |            |1 |0 |1 |
+--+--+--+            +--+--+--+
|1 |1 |1 |  --> H     |1 |0 |1 |    -> U
+--+--+--+            +--+--+--+
|1 |0 |1 |            |1 |1 |1 |
+--+--+--+            +--+--+--+

现在我需要将其搜索到10*10 matrix containing 0s and 1s。最近也是唯一的解决方案我可以得到它的强力算法O（n ^ 4）。在MATLAB和R等语言中有非常微妙的方法可以做到这一点但是不是在C，C ++中。我尝试了很多在Google和SO上搜索这个解决方案。但我最接近的是SO POST讨论实现 Rabin-Karp字符串搜索算法。但是没有伪代码或者任何解释这个的帖子。有人可以帮助或提供任何链接，pdf或一些逻辑来简化这个吗？

修改

作为Eugene Sh。评论说如果N是大矩阵（NxN）的大小而k是小矩阵（kxk），则buteforce算法应该取O（（Nk）^ 2）。由于k是固定的，它减少到O（N ^ 2）。是绝对正确的。但如果N和K很大，有没有任何一般化的方法呢？

Answer 1

好的，这是2D Rabin-Karp方法。

对于以下讨论，假设我们想在（ n ， m ， m ）子矩阵> n 矩阵。（这个概念也适用于矩形矩阵，但我的指数用完了。）

我们的想法是，对于每个可能的子矩阵，我们计算一个哈希值。只有当哈希匹配我们想要找到的矩阵的哈希时，我们才会按元素进行比较。

为了提高效率，我们必须避免每次重新计算子矩阵的整个哈希值。因为今晚我睡不着觉，我唯一可以弄清楚如何做到这一点的哈希函数就是相应子矩阵中1的总和。我将它作为练习留给比我更聪明的人，以找出更好的滚动哈希函数。

现在，如果我们刚刚从（ i ， j ）检查子矩阵到（ i + m < / i> - 1， j + m - 1）并且知道里面有 x 1s，我们可以计算出1的数量子矩阵一到右 - 即从（ i ， j + 1）到（ i + m - 1， j + m ） - 通过从（ i ， j减去子矢量中的1的数量）到（ i + m - 1， j ）并在子载体中添加1的数量（< i> i ， j + m ）到（ i + m - 1， j + m ）。

如果我们点击大矩阵的右边距，我们将窗口向下移动一个然后再向左移动边距然后再向下移动一个然后向右移动，依此类推。

注意，这需要O（ m ）操作，而不是每个候选者的O（ m ²）。如果我们为每对索引执行此操作，我们得到O（ m ²）工作。因此，通过巧妙地将潜在子矩阵的大小的窗口移动通过大矩阵，我们可以将工作量减少因子 m 。也就是说，如果我们没有得到太多的哈希冲突。

这是一张图片：

The rolling hash function for the window shifted to the right.

当我们将当前窗口向右移动时，我们在左侧的红色列向量中减去1的数量，并在右侧的绿色列向量中添加1的数量以获得1的数量在新窗口中。

我使用了很棒的Eigen C ++模板库实现了这个想法的快速演示。该示例还使用了Boost中的一些内容，但仅用于参数解析和输出格式化，因此如果您没有Boost但想要尝试代码，则可以轻松摆脱它。索引摆弄有点乱，但我会在此处不做进一步解释。上述散文应充分涵盖。

#include <cassert>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <random>
#include <type_traits>
#include <utility>

#include <boost/format.hpp>
#include <boost/lexical_cast.hpp>

#include <Eigen/Dense>

#define PROGRAM "submatrix"
#define SEED_CSTDLIB_RAND 1

using BitMatrix = Eigen::Matrix<bool, Eigen::Dynamic, Eigen::Dynamic>;
using Index1D = BitMatrix::Index;
using Index2D = std::pair<Index1D, Index1D>;

std::ostream&
operator<<(std::ostream& out, const Index2D& idx)
{
  out << "(" << idx.first << ", " << idx.second << ")";
  return out;
}

BitMatrix
get_random_bit_matrix(const Index1D rows, const Index1D cols)
{
  auto matrix = BitMatrix {rows, cols};
  matrix.setRandom();
  return matrix;
}

Index2D
findSubMatrix(const BitMatrix& haystack,
              const BitMatrix& needle,
              Index1D *const collisions_ptr = nullptr) noexcept
{
  static_assert(std::is_signed<Index1D>::value, "unsigned index type");
  const auto end = Index2D {haystack.rows(), haystack.cols()};
  const auto hr = haystack.rows();
  const auto hc = haystack.cols();
  const auto nr = needle.rows();
  const auto nc = needle.cols();
  if (nr > hr || nr > hc)
    return end;
  const auto target = needle.count();
  auto current = haystack.block(0, 0, nr - 1, nc).count();
  auto j = Index1D {0};
  for (auto i = Index1D {0}; i <= hr - nr; ++i)
    {
      if (j == 0)  // at left margin
        current += haystack.block(i + nr - 1, 0, 1, nc).count();
      else if (j == hc - nc)  // at right margin
        current += haystack.block(i + nr - 1, hc - nc, 1, nc).count();
      else
        assert(!"this should never happen");
      while (true)
        {
          if (i % 2 == 0)  // moving right
            {
              if (j > 0)
                current += haystack.block(i, j + nc - 1, nr, 1).count();
            }
          else  // moving left
            {
              if (j < hc - nc)
                current += haystack.block(i, j, nr, 1).count();
            }
          assert(haystack.block(i, j, nr, nc).count() == current);
          if (current == target)
            {
              // TODO: There must be a better way than using cwiseEqual().
              if (haystack.block(i, j, nr, nc).cwiseEqual(needle).all())
                return Index2D {i, j};
              else if (collisions_ptr)
                *collisions_ptr += 1;
            }
          if (i % 2 == 0)  // moving right
            {
              if (j < hc - nc)
                {
                  current -= haystack.block(i, j, nr, 1).count();
                  ++j;
                }
              else break;
            }
          else  // moving left
            {
              if (j > 0)
                {
                  current -= haystack.block(i, j + nc - 1, nr, 1).count();
                  --j;
                }
              else break;
            }
        }
      if (i % 2 == 0)  // at right margin
        current -= haystack.block(i, hc - nc, 1, nc).count();
      else  // at left margin
        current -= haystack.block(i, 0, 1, nc).count();
    }
  return end;
}

int
main(int argc, char * * argv)
{
  if (SEED_CSTDLIB_RAND)
    {
      std::random_device rnddev {};
      srand(rnddev());
    }
  if (argc != 5)
    {
      std::cerr << "usage: " << PROGRAM
                << " ROWS_HAYSTACK COLUMNS_HAYSTACK"
                << " ROWS_NEEDLE COLUMNS_NEEDLE"
                << std::endl;
      return EXIT_FAILURE;
    }
  auto hr = boost::lexical_cast<Index1D>(argv[1]);
  auto hc = boost::lexical_cast<Index1D>(argv[2]);
  auto nr = boost::lexical_cast<Index1D>(argv[3]);
  auto nc = boost::lexical_cast<Index1D>(argv[4]);
  const auto haystack = get_random_bit_matrix(hr, hc);
  const auto needle = get_random_bit_matrix(nr, nc);
  auto collisions = Index1D {};
  const auto idx = findSubMatrix(haystack, needle, &collisions);
  const auto end = Index2D {haystack.rows(), haystack.cols()};
  std::cout << "This is the haystack:\n\n" << haystack << "\n\n";
  std::cout << "This is the needle:\n\n" << needle << "\n\n";
  if (idx != end)
    std::cout << "Found as sub-matrix at " << idx << ".\n";
  else
    std::cout << "Not found as sub-matrix.\n";
  std::cout << boost::format("There were %d (%.2f %%) hash collisions.\n")
    % collisions
    % (100.0 * collisions / ((hr - nr) * (hc - nc)));
  return (idx != end) ? EXIT_SUCCESS : EXIT_FAILURE;
}

在编译和运行时，请将上述内容视为伪代码。我几乎没有尝试过优化它。这只是我自己的概念验证。

Answer 2

我将提出一种算法，在最糟糕的情况下，O(n*n)和k = O(sqrt(n))一般会花费O(n*n + n*k*k)时间。这是Aho-Corasick到2D的扩展。回想一下，Aho-Corasick在目标字符串T中找到一组模式的所有出现，并且它在模式长度，T的长度和出现次数中以时间线性进行。

我们来介绍一些术语。干草堆是我们正在搜索的大矩阵，针是图案矩阵。干草堆是nxn矩阵，针是kxk矩阵。我们将在Aho-Corasick中使用的一组模式是针的行集。此集合最多包含k行，如果存在重复行，则会减少。

我们将构建Aho-Corasick自动机（这是一个增加故障链接的Trie），然后在大海捞针的每一行上运行搜索算法。因此，我们采取针的每一行，并在大海捞针的每一行中搜索它。我们可以使用线性时间1D匹配算法来做到这一点，但这仍然是低效的。 Aho-Corasick的优势在于它可以同时搜索所有模式。

在搜索过程中，我们将填充一个矩阵A，稍后我们将使用它。当我们在干草堆的第一行中搜索时，A的第一行填充了干草堆第一行中针的行的出现。因此，我们最终会得到第一行A，例如2 - 0 - - 1。这意味着针的行号0出现在干草堆第一行的位置2;行号1出现在5位置;行号2出现在位置0. -条目是未匹配的位置。继续为每一行做这个。

现在让我们假设针中没有重复的行。将0分配给针的第一行，1分配给第二行，依此类推。现在，我们将使用线性时间1D搜索算法（例如KMP）在矩阵[0 1 2 ... k-1]的每一列中搜索模式A。回想一下，A的每一行都存储针的行出现的位置。因此，如果列包含模式[0 1 2 ... k-1]，则表示针的行号0出现在干草堆的某一行，针的行号1正好在其下方，并且等等。这正是我们想要的。如果有重复的行，只需为每个唯一的行分配一个唯一的编号。

使用线性时间算法在列中搜索O(n)。因此，搜索所有列需要O(n*n)。我们在搜索期间填充矩阵，我们搜索干草堆的每一行（有n行），并且连续搜索需要O(n+k*k)。总的来说O(n(n+k*k))。

所以我的想法是找到矩阵，然后将问题简化为1D模式匹配。 Aho-Corasick只是为了效率，我不知道是否有另一种有效的方法来找到矩阵。

编辑：添加了实施。

这是我的c ++实现。 n的最大值设置为100，但您可以更改它。

程序首先读取两个整数n k（矩阵的维数）。然后它读取n行，每行包含一个0和1的长度为n的字符串。然后它读取k行，每行包含一个0和1的长度为k的字符串。输出是所有匹配的左上角坐标。例如，对于以下输入。

12 2
101110111011
111010111011
110110111011
101110111010
101110111010
101110111010
101110111010
111010111011
111010111011
111010111011
111010111011
111010111011
11
10

程序将输出：

match at (2,0)
match at (1,1)
match at (0,2)
match at (6,2)
match at (2,10)

#include <cstdio>
#include <cstring>
#include <string>
#include <queue>
#include <iostream>

using namespace std;

const int N = 100;
const int M = N;
int n, m;

string haystack[N], needle[M];
int A[N][N]; /* filled by successive calls to match */
int p[N]; /* pattern to search for in columns of A */

struct Node
{   Node *a[2]; /* alphabet is binary */
    Node *suff; /* pointer to node whose prefix = longest proper suffix of this node */
    int flag;

    Node()
    {   a[0] = a[1] = 0;
        suff = 0;
        flag = -1;
    }
};

void insert(Node *x, string s)
{   static int id = 0;
    static int p_size = 0;

    for(int i = 0; i < s.size(); i++)
    {   char c = s[i];
        if(x->a[c - '0'] == 0)
            x->a[c - '0'] = new Node;
        x = x->a[c - '0'];
    }
    if(x->flag == -1)
        x->flag = id++;

    /* update pattern */
    p[p_size++] = x->flag;
}

Node *longest_suffix(Node *x, int c)
{   while(x->a[c] == 0)
        x = x->suff;
    return x->a[c];
}

Node *mk_automaton(void)
{   Node *trie = new Node;
    for(int i = 0; i < m; i++)
    {   insert(trie, needle[i]);
    }

    queue<Node*> q;

    /* level 1 */
    for(int i = 0; i < 2; i++)
    {   if(trie->a[i])
        {   trie->a[i]->suff = trie;
            q.push(trie->a[i]);
        }
        else trie->a[i] = trie;
    }

    /* level > 1 */
    while(q.empty() == false)
    {   Node *x = q.front(); q.pop();
        for(int i = 0; i < 2; i++)
        {   if(x->a[i] == 0) continue;
            x->a[i]->suff = longest_suffix(x->suff, i);
            q.push(x->a[i]);
        }
    }

    return trie;
}

/* search for patterns in haystack[j] */
void match(Node *x, int j)
{   for(int i = 0; i < n; i++)
    {   x = longest_suffix(x, haystack[j][i] - '0');
        if(x->flag != -1)
        {   A[j][i-m+1] = x->flag;
        }
    }
}

int match2d(Node *x)
{   int matches = 0;
    static int z[M+N];
    static int z_str[M+N+1];

    /* init */
    memset(A, -1, sizeof(A));

    /* fill the A matrix */
    for(int i = 0; i < n; i++)
    {   match(x, i);
    }

    /* build string for z algorithm */
    z_str[n+m] = -2; /* acts like `\0` for strings */
    for(int i = 0; i < m; i++)
    {   z_str[i] = p[i];
    }

    for(int i = 0; i < n; i++)
    {   /* search for pattern in column i */
        for(int j = 0; j < n; j++)
        {   z_str[j + m] = A[j][i];
        }

        /* run z algorithm */
        int l, r;
        l = r = 0;
        z[0] = n + m;
        for(int j = 1; j < n + m; j++)
        {   if(j > r)
            {   l = r = j;
                while(z_str[r] == z_str[r - l]) r++;
                z[j] = r - l;
                r--;
            }
            else
            {   if(z[j - l] < r - j + 1)
                {   z[j] = z[j - l];
                }
                else
                {   l = j;
                    while(z_str[r] == z_str[r - l]) r++;
                    z[j] = r - l;
                    r--;
                }
            }
        }

        /* locate matches */
        for(int j = m; j < n + m; j++)
        {   if(z[j] >= m)
            {   printf("match at (%d,%d)\n", j - m, i);
                matches++;
            }
        }
    }

    return matches;
}

int main(void)
{   cin >> n >> m;
    for(int i = 0; i < n; i++)
    {   cin >> haystack[i];
    }
    for(int i = 0; i < m; i++)
    {   cin >> needle[i];
    }

    Node *trie = mk_automaton();
    match2d(trie);

    return 0;
}

Answer 3

让我们从O(N * N * K)解决方案开始。我将使用以下表示法：A是一个模式矩阵，B是一个大矩阵（我们将搜索模式中出现的那个）。

我们可以修复B矩阵的顶行（也就是说，我们将搜索在某个位置开始的所有匹配（此行，任何列）。让我们将此行称为{{ 1}}。现在我们可以得到一个包含topRow行和所有列的矩阵。
让我们根据连接[topRow; topRow + K)创建一个新矩阵，其中A + column + the slice是column中不存在的K元素的列或A（如果B和A由B和0组成，我们可以使用1。现在我们可以将这个新矩阵的列视为字母并运行Knuth-Morris-Pratt的算法。比较两个字母需要-1时间，因此此步骤的时间复杂度为O(K)。

有O(N * K)种方法可以修复第一行，因此总时间复杂度为O(N)。它已经比蛮力解决方案更好了，但我们还没有完成。理论下界是O(N * N * K)（我假设O(N * N)），我想实现它。

让我们来看看这里可以改进的地方。如果我们可以在N >= K时间而不是O(1)中比较矩阵的两列，我们就可以达到所需的时间复杂度。让我们连接O(k)和A的所有列，在每列之后插入一些分隔符。现在我们有一个字符串，我们需要比较它的子字符串（因为列和它们的部分现在是子字符串）。让我们用线性时间构造一个后缀树（使用Ukkonnen的算法）。现在比较两个子串就是要找到这棵树中两个节点的最低共同祖先（LCA）的高度。有一个algorithm允许我们使用线性预处理时间和每个LCA查询B时间来执行此操作。这意味着我们可以在恒定时间内比较两个子串（或列）！因此，总时间复杂度为O(1)。还有另一种方法可以实现这种时间复杂度：我们可以在线性时间内构建后缀数组，并在恒定时间内回答最长的公共前缀查询（使用线性时间预处理）。但是，这两个O(N * N)解决方案看起来很难实现，并且它们将具有很大的常数。

PS如果我们有一个我们可以完全信任的多项式哈希函数（或者我们可以使用一些误报），我们可以使用二维多项式哈希得到一个更简单的O(N * N)解决方案。

找出2d矩阵是否是另一个2d矩阵的子集

3 个答案: