使用自定义分隔符解析一个巨大的文本文件(大约2GB)

时间:2017-12-14 01:41:43

标签: c#

我有一个大约2GB的文本文件,我试图在C#中解析。 该文件具有行和列的自定义分隔符。我想解析文件并提取数据并写入另一个文件,方法是插入列标题,用换行符替换RowDelimiter,用tab替换ColumnDelimiter,这样我就能以表格格式获取数据。

样本数据:
1'〜'2'~'3 ##### 11'〜'12'〜'13

RowDelimiter:#####
ColumnDelimiter:'~'

我继续在下一行获取System.OutOfMemoryException

while ((line = rdr.ReadLine()) != null)

public void ParseFile(string inputfile,string outputfile,string header)
{

    using (StreamReader rdr = new StreamReader(inputfile))
    {
        string line;

        while ((line = rdr.ReadLine()) != null)
        {
            using (StreamWriter sw = new StreamWriter(outputfile))
            {
                //Write the Header row
                sw.Write(header);

                //parse the file
                string[] rows = line.Split(new string[] { ParserConstants.RowSeparator },
                    StringSplitOptions.None);

                foreach (string row in rows)
                {
                    string[] columns = row.Split(new string[] {ParserConstants.ColumnSeparator},
                        StringSplitOptions.None);
                    foreach (string column in columns)
                    {
                        sw.Write(column + "\\t");
                    }
                    sw.Write(ParserConstants.NewlineCharacter);
                    Console.WriteLine();
                }
            }

            Console.WriteLine("File Parsing completed");

        }
    }
}

5 个答案:

答案 0 :(得分:1)

将数据读入缓冲区,然后进行解析。

using (StreamReader rdr = new StreamReader(inputfile))
using (StreamWriter sw = new StreamWriter(outputfile))
{
    char[] buffer = new char[256];
    int read;

    //Write the Header row
    sw.Write(header);

    string remainder = string.Empty;
    while ((read = rdr.Read(buffer, 0, 256)) > 0)
    {
        string bufferData = new string(buffer, 0, read);
        //parse the file
        string[] rows = bufferData.Split(
            new string[] { ParserConstants.RowSeparator },
            StringSplitOptions.None);

        rows[0] = remainder + rows[0];
        int completeRows = rows.Length - 1;
        remainder = rows.Last();
        foreach (string row in rows.Take(completeRows))
        {
            string[] columns = row.Split(
                new string[] {ParserConstants.ColumnSeparator},
                StringSplitOptions.None);
            foreach (string column in columns)
            {
                sw.Write(column + "\\t");
            }
            sw.Write(ParserConstants.NewlineCharacter);
            Console.WriteLine();
        }
    }

    if(reamainder.Length > 0)
    {
        string[] columns = remainder.Split(
        new string[] {ParserConstants.ColumnSeparator},
        StringSplitOptions.None);
        foreach (string column in columns)
        {
            sw.Write(column + "\\t");
        }
        sw.Write(ParserConstants.NewlineCharacter);
        Console.WriteLine();
    }

    Console.WriteLine("File Parsing completed");
}

答案 1 :(得分:1)

您遇到的问题是您急切地消耗整个文件并将其放入内存中。正如您现在所知,尝试在内存中拆分2GB文件会有问题。

解决方案?每次消耗一次石灰。因为您的文件没有标准的行分隔符,所以您必须实现一个自定义解析器来为您执行此操作。以下代码就是这样(或者我认为它确实如此,我还没有测试过它)。从性能角度来看,它可能非常容易实现,但至少应该让你从正确的方向开始(c#7语法):

public static IEnumerable<string> GetRows(string path, string rowSeparator)
{
    bool tryParseSeparator(StreamReader reader, char[] buffer)
    {
        var count = reader.Read(buffer, 0, buffer.Length);

        if (count != buffer.Length)
            return false;

        return Enumerable.SequenceEqual(buffer, rowSeparator);
    }

    using (var reader = new StreamReader(path))
    {
        int peeked;
        var rowBuffer = new StringBuilder();
        var separatorBuffer = new char[rowSeparator.Length];

        while ((peeked = reader.Peek()) > -1)
        {
            if ((char)peeked == rowSeparator[0])
            {
                if (tryParseSeparator(reader, separatorBuffer))
                {
                    yield return rowBuffer.ToString();
                    rowBuffer.Clear();
                }
                else
                {
                    rowBuffer.Append(separatorBuffer);
                }
            }
            else
            {
                rowBuffer.Append((char)reader.Read());
            }
        }

        if (rowBuffer.Length > 0)
            yield return rowBuffer.ToString();
    }
}

现在,您的文件中存在行的延迟枚举,您可以按照预期进行处理:

foreach (var row in GetRows(inputFile, ParserConstants.RowSeparator))
{
     var columns = line.Split(new string[] {ParserConstants.ColumnSeparator},
                              StringSplitOptions.None);
     //etc.
}

答案 2 :(得分:1)

正如评论中已经提到的那样,您无法使用ReadLine来处理此问题,您必须一次处理一个字节(或字符)的数据。好消息是,这基本上是ReadLine的工作原理,所以在这种情况下我们不会失去很多。

使用StreamReader,我们可以从源流中读取一系列字符(以您需要的任何编码)到数组中。使用它和StringBuilder我们可以在块中处理流并检查路上的分隔符序列。

这是一个处理任意分隔符的方法:

public static IEnumerable<string> ReadDelimitedRows(StreamReader reader, string delimiter)
{
    char[] delimChars = delimiter.ToArray();
    int matchCount = 0;
    char[] buffer = new char[512];
    int rc = 0;
    StringBuilder sb = new StringBuilder();

    while ((rc = reader.Read(buffer, 0, buffer.Length)) > 0)
    {
        for (int i = 0; i < rc; i++)
        {
            char c = buffer[i];
            if (c == delimChars[matchCount])
            {
                if (++matchCount >= delimChars.Length)
                {
                    // found full row delimiter
                    yield return sb.ToString();
                    sb.Clear();
                    matchCount = 0;
                }
            }
            else
            {
                if (matchCount > 0)
                {
                    // append previously matched portion of the delimiter
                    sb.Append(delimChars.Take(matchCount));
                    matchCount = 0;
                }
                sb.Append(c);
            }
        }
    }
    // return the last row if found
    if (sb.Length > 0)
        yield return sb.ToString();
}

这应该可以处理块分隔符的一部分可以出现在实际数据中的任何情况。

为了将您描述的输入格式的文件转换为简单的制表符分隔格式,您可以按照以下方式执行操作:

const string RowDelimiter = "#####";
const string ColumnDelimiter = "'~'";

using (var reader = new StreamReader(inputFilename))
using (var writer = new StreamWriter(File.Create(ouputFilename)))
{
    foreach (var row in ReadDelimitedRows(reader, RowDelimiter))
    {
        writer.Write(row.Replace(ColumnDelimiter, "\t"));
    }
}

这应该相当快速地处理而不会占用太多内存。非ASCII输出可能需要进行一些调整。

答案 3 :(得分:0)

我认为这应该可以解决问题......

public void ParseFile(string inputfile, string outputfile, string header)
{
    int blockSize = 1024;

    using (var file = File.OpenRead(inputfile))
    {
        using (StreamWriter sw = new StreamWriter(outputfile))
        {
            int bytes = 0;
            int parsedBytes = 0;
            var buffer = new byte[blockSize];
            string lastRow = string.Empty;

            while ((bytes = file.Read(buffer, 0, buffer.Length)) > 0)
            {
                // Because the buffer edge could split a RowDelimiter, we need to keep the
                // last row from the prior split operation.  Append the new buffer to the
                // last row from the prior loop iteration.
                lastRow += Encoding.Default.GetString(buffer,0, bytes);

                //parse the file
                string[] rows = lastRow.Split(new string[] { ParserConstants.RowSeparator }, StringSplitOptions.None);

                // We cannot process the last row in this set because it may not be a complete
                // row, and tokens could be clipped.
                if (rows.Count() > 1)
                {
                    for (int i = 0; i < rows.Count() - 1; i++)
                    {
                        sw.Write(new Regex(ParserConstants.ColumnSeparator).Replace(rows[i], "\t") + ParserConstants.NewlineCharacter);
                    }
                }
                lastRow = rows[rows.Count() - 1];
                parsedBytes += bytes;
                // The following statement is not quite true because we haven't parsed the lastRow.
                Console.WriteLine($"Parsed {parsedBytes.ToString():N0} bytes");
            }
            // Now that there are no more bytes to read, we know that the lastrow is complete.
            sw.Write(new Regex(ParserConstants.ColumnSeparator).Replace(lastRow, "\t"));
        }
    }
    Console.WriteLine("File Parsing completed.");
}

答案 4 :(得分:0)

这里的聚会晚了,但是如果其他人想知道使用自定义分隔符加载这么大的CSV文件的简单方法,Cinchoo ETL可以帮到你。

using (var parser = new ChoCSVReader("CustomNewLine.csv")
    .WithDelimiter("~")
    .WithEOLDelimiter("#####")
    )
{
    foreach (dynamic x in parser)
        Console.WriteLine(x.DumpAsJson());
}

免责声明:我是这个图书馆的作者。