Question

我有一个包含文本的二进制数据。该文本是已知的。什么是搜索该文本的快速方法：

作为例如。

This is text 1---
!@##$%%#^%&!%^$! <= Assume this line is 3 MB of binary data
Now, This is text 2 ---
!@##$%%#^%&!%^$! <= Assume this line is 2.5 MB of binary data
This is text 3 ---

如何搜索文字This is text 2。

目前我的表现如下：

size_t count = 0;
size_t s_len = strlen("This is text 2");

//Assume data_len is length of the data from which text is to be found and data is pointer (char*) to the start of it.
for(; count < data_len; ++count)
{
    if(!memcmp("This is text 2", data + count, s_len)
    {
         printf("%s\n", "Hurray found you...");
    }
}

还有其他办法，更有效的方法吗
将++count logic替换为memchr('T') logic帮助＆lt; =如果此声明不明确，请忽略
memchr

Answer 1

有一些算法可以比重复memcmp具有更好的复杂性（这种算法以明显的方式实现，并且近似匹配具有明显的复杂性）。

着名的算法是Boyer-Moore和Knuth-Morris-Pratt。这只是两个例子。这些下降的一般类别是“字符串匹配”。

Answer 2

标准C中没有任何内容可以帮助您，但有一个GNU扩展memmem()可以执行此操作：

#define TEXT2 "This is text 2"

char *pos = memmem(data, data_len, TEXT2, sizeof(TEXT2));

if (pos != NULL)
    /* Found it. */

如果您需要移植到没有此功能的系统，您可以glibc实施memmem()并将其合并到您的计划中。

Answer 3

我知道问题是关于C编程语言，但您是否尝试过使用字符串 unix工具：http://en.wikipedia.org/wiki/Strings_(Unix）和 grep ？

$ strings datafile | grep "your text"

编辑：

如果你想使用C，我建议你做这个简单的优化：

size_t count = 0;
size_t s_len = strlen("This is text 2");

for(; count < data_len; ++count)
{
    if (!isprint(data[count])) continue;

    if(!memcmp("This is text 2", data + count, s_len)
    {
     printf("%s\n", "Hurray found you...");
    }
}

如果您想获得更好的性能，我建议您搜索并使用字符串匹配算法。

在二进制数据中搜索文本

3 个答案: