我有一个巨大的文件,我买不起加载到内存和我需要在里面找到的字节序列。
这就是我现在使用的:
public static byte[] GetRangeFromStream(ref FileStream fs, long start_index, long count)
{
byte[] data = new byte[count];
long prev_pos = fs.Position;
fs.Position = start_index;
fs.Read(data, 0, data.Length);
fs.Position = prev_pos;
return data;
}
public static long GetSequenceIndexInFileStream(byte[] seq, ref FileStream fs, long index, bool get_beginning = true)
{
if (index >= fs.Length)
return -1;
fs.Position = index;
for (long i = index; i < fs.Length; i++)
{
byte temp_byte = (byte)fs.ReadByte();
if (temp_byte == seq[0] && IsArraysSame(seq, GetRangeFromStream(ref fs, i, seq.Length))) //compare just first bytes and then compare seqs if needed
return i;
}
return -1;
}
答案 0 :(得分:2)
执行此操作的最佳方法是逐字节读取文件流,仅查找要搜索的字符串的第一个字节。当你得到一场比赛时,你知道你可能会受到打击。读取下一个N个字节(其中N是要搜索的字符串的长度),并对从文件读取的字节内容和要搜索的字节进行数组比较。
让.net通过在打开时设置适当的FileStream缓冲区来处理文件读取缓冲区。不要担心阅读前进以创建自己的缓冲区 - 这是浪费时间 - .net做得很好。
这种方法意味着你不是比较每个字节的数组,而是你不关心跨缓冲区的拆分等。
如果文件非常大并且您不是I / O Bound,那么您可以考虑创建多个任务,从文件流中的不同位置开始,每个任务库都使用.net任务库,并将每个任务的结果关联起来一切都完成了。
答案 1 :(得分:1)
您可能需要查看 file mapping 。它允许您基本上将大文件视为内存缓冲区,从而允许在磁盘文件上使用任何基于内存的API 。没有明确的文件I / O。
MSDN:
文件映射是文件内容与进程的虚拟地址空间的一部分的关联。系统创建文件映射对象(也称为节对象)以维护此关联。文件视图是进程用于访问文件内容的虚拟地址空间的一部分。文件映射允许进程使用随机输入和输出(I / O)和顺序I / O.它还允许进程有效地使用大型数据文件(如数据库),而无需将整个文件映射到内存中。多个进程也可以使用内存映射文件来共享数据.... tell me more...
答案 2 :(得分:0)
我建议使用Knuth Moriss Pratt algorithm的修改版本。 算法kmp_search: 输入: 一串字符,S(要搜索的文本) 一个字符数组,W(寻求的单词) 输出: 一个整数(在S中找到W的从零开始的位置)
define variables:
an integer, m ← 0 (the beginning of the current match in S)
an integer, i ← 0 (the position of the current character in W)
an array of integers, T (the table, computed elsewhere)
while m + i < length(S) do
if W[i] = S[m + i] then
if i = length(W) - 1 then
return m
let i ← i + 1
else
if T[i] > -1 then
let m ← m + i - T[i], i ← T[i]
else
let m ← m + 1, i ← 0
(if we reach here, we have searched all of S unsuccessfully)
return the length of S
文本字符串可以流入,因为KMP算法不会在文本中回溯。 (这是对天真算法的另一种改进,它自然不支持流式传输。)如果是流式传输,处理传入字符的分摊时间是Ɵ(1),但最坏情况时间是Ɵ(min(m,n') )),其中n'是到目前为止看到的文本字符数。 Source
可以找到参考(Java)实现here
package com.twitter.elephantbird.util;
import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;
/**
* An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
* For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
public class StreamSearcher {
protected byte[] pattern_;
protected int[] borders_;
// An upper bound on pattern length for searching. Results are undefined for longer patterns.
public static final int MAX_PATTERN_LENGTH = 1024;
public StreamSearcher(byte[] pattern) {
setPattern(pattern);
}
/**
* Sets a new pattern for this StreamSearcher to use.
* @param pattern
* the pattern the StreamSearcher will look for in future calls to search(...)
*/
public void setPattern(byte[] pattern) {
pattern_ = Arrays.copyOf(pattern, pattern.length);
borders_ = new int[pattern_.length + 1];
preProcess();
}
/**
* Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
* that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
* byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
* another reasonable default, i.e. leave the stream unchanged.
*
* @return bytes consumed if found, -1 otherwise.
* @throws IOException
*/
public long search(InputStream stream) throws IOException {
long bytesRead = 0;
int b;
int j = 0;
while ((b = stream.read()) != -1) {
bytesRead++;
while (j >= 0 && (byte)b != pattern_[j]) {
j = borders_[j];
}
// Move to the next character in the pattern.
++j;
// If we've matched up to the full pattern length, we found it. Return,
// which will automatically save our position in the InputStream at the point immediately
// following the pattern match.
if (j == pattern_.length) {
return bytesRead;
}
}
// No dice, Note that the stream is now completely consumed.
return -1;
}
/**
* Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
* and aids in implementation of the Knuth-Moore-Pratt string search.
* <p>
* For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
*/
protected void preProcess() {
int i = 0;
int j = -1;
borders_[i] = j;
while (i < pattern_.length) {
while (j >= 0 && pattern_[i] != pattern_[j]) {
j = borders_[j];
}
borders_[++i] = ++j;
}
}
}