读取大文本文件直到某个字符串

时间:2014-05-13 07:19:49

标签: .net string file large-files

我有一个大字符串分隔的文本文件(不是单字符分隔),如下所示:

  

第一个数据[STRING-SEPERATOR]第二个数据[STRING-SEPERATOR] ......

我不想将整个文件加载到内存中,因为它的大小(~250MB)。如果我使用System.IO.File.ReadAllText阅读整个文件,我会获得OutOfMemoryException

因此,我希望在[STRING-SEPERATOR]的第一次出现之前读取该文件,然后继续下一个字符串。它喜欢"采取" first data关闭文件,处理它并继续使用second data,它现在是文件的第一个数据。

System.IO.StreamReader.ReadLine()对我没有帮助,因为该文件的内容是一行。

您是否知道如何读取文件直到.NET中的某个字符串?

我希望有些想法,谢谢。

4 个答案:

答案 0 :(得分:1)

这应该对你有帮助。

private IEnumerable<string> ReadCharsByChunks(int chunkSize, string filePath)
{
    using (FileStream fs = new FileStream(filePath, FileMode.Open))
    {
        byte[] buffer = new byte[chunkSize]; 
        int currentRead;
        while ((currentRead = fs.Read(buffer, 0, chunkSize)) > 0)
        {
            yield return Encoding.Default.GetString(buffer, 0, currentRead);
        }
    }
}

private void SearchWord(string searchWord)
{
    StringBuilder builder = new StringBuilder();
    foreach (var chars in ReadCharsByChunks(2, "sample.txt"))//Can be any number
    {
        builder.Append(chars);

        var existing = builder.ToString();
        int foundIndex = -1;
        if ((foundIndex = existing.IndexOf(searchWord)) >= 0)
        {
            //Found
            MessageBox.Show("Found");

            builder.Remove(0, foundIndex + searchWord.Length);
        }
        else if (!existing.Contains(searchWord.First()))
        {
            builder.Clear();
        }
    }
}

答案 1 :(得分:0)

文本文件也可以按字符方式读取,如this questions中所述。要搜索某个字符串,您必须使用一些手动实现的逻辑,该逻辑可以根据字符输入搜索所需的字符串,这可以通过状态机完成。

答案 2 :(得分:0)

StreamReader.Read有一些可能对你有帮助的重载。 试试这个:

int index, count;
index = 0;
count = 200; // or whatever number you think is better
char[] buffer = new char[count];
System.IO.StreamReader sr = new System.IO.StreamReader("Path here");
while (sr.Read(buffer, index, count) > 0) { 
    /*
    check if buffer contains your string seperator, or at least some part of it 
    if it contains a part of it, you need check the rest of the stream to make sure it's a real seporator
    do your stuff, set the index to one character after the last seporator.
    */
}

答案 3 :(得分:0)

感谢您的回复。这是我在VB.NET中编写的函数:

Public Function ReadUntil(Stream As System.IO.FileStream, UntilText As String) As String
            Dim builder As New System.Text.StringBuilder()
            Dim returnTextBuilder As New System.Text.StringBuilder()
            Dim returnText As String = String.Empty
            Dim size As Integer = CInt(UntilText.Length / 2) - 1
            Dim buffer(size) As Byte
            Dim currentRead As Integer = -1

            Do Until currentRead = 0
                Dim collected As String = Nothing
                Dim chars As String = Nothing
                Dim foundIndex As Integer = -1

                currentRead = Stream.Read(buffer, 0, buffer.Length)
                chars = System.Text.Encoding.Default.GetString(buffer, 0, currentRead)

                builder.Append(chars)
                returnTextBuilder.Append(chars)

                collected = builder.ToString()
                foundIndex = collected.IndexOf(UntilText)

                If (foundIndex >= 0) Then
                    returnText = returnTextBuilder.ToString()

                    Dim indexOfSep As Integer = returnText.IndexOf(UntilText)
                    Dim cutLength As Integer = returnText.Length - indexOfSep

                    returnText = returnText.Remove(indexOfSep, cutLength)

                    builder.Remove(0, foundIndex + UntilText.Length)

                    If (cutLength > UntilText.Length) Then
                        Stream.Position = Stream.Position - (cutLength - UntilText.Length)
                    End If

                    Return returnText
                ElseIf (Not collected.Contains(UntilText.First())) Then
                    builder.Length = 0
                End If
            Loop

            Return String.Empty
    End Function