Question

我有一个巨大的文件，我买不起加载到内存和我需要在里面找到的字节序列。

这就是我现在使用的：

public static byte[] GetRangeFromStream(ref FileStream fs, long start_index, long count)
{
    byte[] data = new byte[count];
    long prev_pos = fs.Position;
    fs.Position = start_index;
    fs.Read(data, 0, data.Length);
    fs.Position = prev_pos;
    return data;
}

public static long GetSequenceIndexInFileStream(byte[] seq, ref FileStream fs, long index, bool get_beginning = true)
{
    if (index >= fs.Length)
        return -1;

    fs.Position = index;

    for (long i = index; i < fs.Length; i++)
    {
        byte temp_byte = (byte)fs.ReadByte();

        if (temp_byte == seq[0] && IsArraysSame(seq, GetRangeFromStream(ref fs, i, seq.Length))) //compare just first bytes and then compare seqs if needed
            return i;
    }

    return -1;
}

Answer 1

执行此操作的最佳方法是逐字节读取文件流，仅查找要搜索的字符串的第一个字节。当你得到一场比赛时，你知道你可能会受到打击。读取下一个N个字节（其中N是要搜索的字符串的长度），并对从文件读取的字节内容和要搜索的字节进行数组比较。

让.net通过在打开时设置适当的FileStream缓冲区来处理文件读取缓冲区。不要担心阅读前进以创建自己的缓冲区 - 这是浪费时间 - .net做得很好。

这种方法意味着你不是比较每个字节的数组，而是你不关心跨缓冲区的拆分等。

如果文件非常大并且您不是I / O Bound，那么您可以考虑创建多个任务，从文件流中的不同位置开始，每个任务库都使用.net任务库，并将每个任务的结果关联起来一切都完成了。

Answer 2

您可能需要查看 file mapping 。它允许您基本上将大文件视为内存缓冲区，从而允许在磁盘文件上使用任何基于内存的API 。没有明确的文件I / O。

MSDN：

文件映射是文件内容与进程的虚拟地址空间的一部分的关联。系统创建文件映射对象（也称为节对象）以维护此关联。文件视图是进程用于访问文件内容的虚拟地址空间的一部分。文件映射允许进程使用随机输入和输出（I / O）和顺序I / O.它还允许进程有效地使用大型数据文件（如数据库），而无需将整个文件映射到内存中。多个进程也可以使用内存映射文件来共享数据.... tell me more...

Answer 3

我建议使用Knuth Moriss Pratt algorithm的修改版本。算法kmp_search：输入：一串字符，S（要搜索的文本）一个字符数组，W（寻求的单词）输出：一个整数（在S中找到W的从零开始的位置）

define variables:
    an integer, m ← 0 (the beginning of the current match in S)
    an integer, i ← 0 (the position of the current character in W)
    an array of integers, T (the table, computed elsewhere)

while m + i < length(S) do
    if W[i] = S[m + i] then
        if i = length(W) - 1 then
            return m
        let i ← i + 1
    else
        if T[i] > -1 then
            let m ← m + i - T[i], i ← T[i]
        else
            let m ← m + 1, i ← 0

(if we reach here, we have searched all of S unsuccessfully)
return the length of S

文本字符串可以流入，因为KMP算法不会在文本中回溯。（这是对天真算法的另一种改进，它自然不支持流式传输。）如果是流式传输，处理传入字符的分摊时间是Ɵ（1），但最坏情况时间是Ɵ（min（m，n'））），其中n'是到目前为止看到的文本字符数。 Source

可以找到参考（Java）实现here

package com.twitter.elephantbird.util;

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher {

  protected byte[] pattern_;
  protected int[] borders_;

  // An upper bound on pattern length for searching. Results are undefined for longer patterns.
  public static final int MAX_PATTERN_LENGTH = 1024;

  public StreamSearcher(byte[] pattern) {
    setPattern(pattern);
  }

  /**
   * Sets a new pattern for this StreamSearcher to use.
   * @param pattern
   *          the pattern the StreamSearcher will look for in future calls to search(...)
   */
  public void setPattern(byte[] pattern) {
    pattern_ = Arrays.copyOf(pattern, pattern.length);
    borders_ = new int[pattern_.length + 1];
    preProcess();
  }

  /**
   * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
   * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
   * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
   * another reasonable default, i.e. leave the stream unchanged.
   *
   * @return bytes consumed if found, -1 otherwise.
   * @throws IOException
   */
  public long search(InputStream stream) throws IOException {
    long bytesRead = 0;

    int b;
    int j = 0;

    while ((b = stream.read()) != -1) {
      bytesRead++;

      while (j >= 0 && (byte)b != pattern_[j]) {
        j = borders_[j];
      }
      // Move to the next character in the pattern.
      ++j;

      // If we've matched up to the full pattern length, we found it.  Return,
      // which will automatically save our position in the InputStream at the point immediately
      // following the pattern match.
      if (j == pattern_.length) {
        return bytesRead;
      }
    }

    // No dice, Note that the stream is now completely consumed.
    return -1;
  }

  /**
   * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
   * and aids in implementation of the Knuth-Moore-Pratt string search.
   * <p>
   * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
   */
  protected void preProcess() {
    int i = 0;
    int j = -1;
    borders_[i] = j;
    while (i < pattern_.length) {
      while (j >= 0 && pattern_[i] != pattern_[j]) {
        j = borders_[j];
      }
      borders_[++i] = ++j;
    }
  }
}

类似的问题：Efficient way to search a stream for a string

有没有更快的方法来搜索大文件而不将其加载到内存中？

3 个答案: