拆分文本文件,最快的方法

时间:2012-03-03 10:05:01

标签: c# c#-4.0 file-io

上午,

我正在尝试使用StreamReader / StreamWriter拆分大型文本文件(15,000,000行)。有更快的方法吗?

我测试了130,000行,花费了2分40秒,这意味着15,000,000行需要大约5小时,这看起来有点过分。

//Perform split.
public void SplitFiles(int[] newFiles, string filePath, int processorCount)
{
    using (StreamReader Reader = new StreamReader(filePath))
    {
        for (int i = 0; i < newFiles.Length; i++)
        {
            string extension = System.IO.Path.GetExtension(filePath);
            string temp = filePath.Substring(0, filePath.Length - extension.Length)
                              + i.ToString();
            string FilePath = temp + extension;

            if (!File.Exists(FilePath))
            {
                for (int x = 0; x < newFiles[i]; x++)
                {
                    DataWriter(Reader.ReadLine(), FilePath);
                }
            }
            else
            {
                return;
            }
        }
    }
}

public void DataWriter(string rowData, string filePath)
{
    bool appendData = true;
    using (StreamWriter sr = new StreamWriter(filePath, appendData))
    {
        {
            sr.WriteLine(rowData);
        }
    }
}

感谢您的帮助。

3 个答案:

答案 0 :(得分:1)

有分割文件的工具可能胜过您的解决方案 - 例如搜索“逐行拆分”。

如果它们不合适,可以使用解决方案将所有源文件加载到内存中然后写出文件,但考虑到源文件的大小,这可能不合适。

在改进代码方面,一个小的改进是生成目标文件路径(并澄清您使用的源文件路径与目标文件之间的混淆)。每次循环时都不需要重新建立源文件扩展名。

第二个改进(可能是更重要的改进 - 正如评论者所强调的)是关于如何写出目标文件 - 这些文件似乎与源有不同的行数(每个newFiles条目中的值)您在单个目标文件中指定了您想要的?因此,我建议您为每个条目读取与下一个目标文件相关的所有源文件,然后输出目标而不是重复打开目标文件。您可以“收集”StringBuilder / List等中的行 - 或者直接将它们写入目标文件(但只打开一次

    public void SplitFiles(int[] newFiles, string sourceFilePath, int processorCount)
    {
        string sourceDirectory = System.IO.Path.GetDirectoryName(sourceFilePath);
        string sourceFileName = System.IO.Path.GetFileNameWithoutExtension(sourceFilePath);
        string extension = System.IO.Path.GetExtension(sourceFilePath);

        using (StreamReader Reader = new StreamReader(sourceFilePath))
        {
            for (int i = 0; i < newFiles.Length; i++)
            {
                string destinationFileNameWithExtension = string.Format("{0}{1}{2}", sourceFileName, i, extension);

                string destinationFilePath = System.IO.Path.Combine(sourceDirectory, destinationFileNameWithExtension);

                if (!File.Exists(destinationFilePath))
                {
                    // Read all the lines relevant to this destination file
                    // and temporarily store them in memory
                    StringBuilder destinationText = new StringBuilder();
                    for (int x = 0; x < newFiles[i]; x++)
                    {
                        destinationText.Append(Reader.ReadLine());
                    }
                    DataWriter(destinationFilePath, destinationText.ToString());
                }
                else
                {
                    return;
                }
            }
        }
    }

private static void DataWriter(string destinationFilePath, string content)
{
    using (StreamWriter sr = new StreamWriter(destinationFilePath))
    {
        {
            sr.Write(content);
        }
    }
}

答案 1 :(得分:1)

你还没有说清楚,但我假设newFiles数组的每个元素的值是从原始数据复制到该文件的行数。请注意,目前您没有检测到输入文件末尾有额外数据的情况,或者它比预期的短。我怀疑你想要这样的东西:

public void SplitFiles(int[] newFiles, string inputFile)
{
    string baseName = Path.GetFileNameWithoutExtension(inputFile);
    string extension = Path.GetExtension(inputFile);
    using (TextReader reader = File.OpenText(inputFile))
    {
        for (int i = 0; i < newFiles.Length; i++)
        {
            string outputFile = baseName + i + extension;
            if (File.Exists(outputFile))
            {
                // Better than silently returning, I'd suggest...
                throw new IOException("File already exists: " + outputFile);
            }

            int linesToCopy = newFiles[i];
            using (TextWriter writer = File.CreateText(outputFile))
            {
                for (int j = 0; i < linesToCopy; j++)
                {
                    string line = reader.ReadLine();
                    if (line == null)
                    {
                        return; // Premature end of input
                    }
                    writer.WriteLine(line);
                }
            }
        }
    }
}

请注意,这仍然无法检测是否有任何未使用的输入......在这种情况下,您不清楚自己想要做什么。

代码清晰度的一个选项是将其中间提取为单独的方法:

public void SplitFiles(int[] newFiles, string inputFile)
{
    string baseName = Path.GetFileNameWithoutExtension(inputFile);
    string extension = Path.GetExtension(inputFile);
    using (TextReader reader = File.OpenText(inputFile))
    {
        for (int i = 0; i < newFiles.Length; i++)
        {
            string outputFile = baseName + i + extension;
            // Could put this into the CopyLines method if you wanted
            if (File.Exists(outputFile))
            {
                // Better than silently returning, I'd suggest...
                throw new IOException("File already exists: " + outputFile);
            }

            CopyLines(reader, outputFile, newFiles[i]);
        }
    }
}

private static void CopyLines(TextReader reader, string outputFile, int count)
{
    using (TextWriter writer = File.CreateText(outputFile))
    {
        for (int i = 0; i < count; i++)
        {
            string line = reader.ReadLine();
            if (line == null)
            {
                return; // Premature end of input
            }
            writer.WriteLine(line);
        }
    }
}

答案 2 :(得分:0)

我最近不得不为每个2 GB以下的数百个文件(最高1.92 GB)执行此操作,并且我找到的最快的方法(如果您有可用的内存)是StringBuilder。我试过的所有其他方法都很慢。

请注意,这取决于内存。相应地调整“ CurrentPosition = 130000 ”。

        string CurrentLine = String.Empty;
        int CurrentPosition = 0;
        int CurrentSplit = 0;

        foreach (string file in Directory.GetFiles(@"C:\FilesToSplit"))
        {
            StringBuilder sb = new StringBuilder();
            using (StreamReader sr = new StreamReader(file))
            {
                while ((CurrentLine = sr.ReadLine()) != null)
                {
                    if (CurrentPosition == 130000) // Or whatever you want to split by.
                    {
                        using (StreamWriter sw = new StreamWriter(@"C:\FilesToSplit\SplitFiles\" + Path.GetFileNameWithoutExtension(file) + "-" + CurrentSplit + "." + Path.GetExtension(file)))
                        {
                            // Append this line too, so we don't lose it.
                            sb.Append(CurrentLine);
                            // Write the StringBuilder contents
                            sw.Write(sb.ToString());
                            // Clear the StringBuilder buffer, so it doesn't get too big. You can adjust this based on your computer's available memory.
                            sb.Clear();
                            // Increment the CurrentSplit number.
                            CurrentSplit++;
                            // Reset the current line position. We've found 130,001 lines of text.
                            CurrentPosition = 0;
                        }
                    }
                    else
                    {
                        sb.Append(CurrentLine);
                        CurrentPosition++;
                    }
                }
            }
            // Reset the integers at the end of each file check, otherwise it can quickly go out of order.
            CurrentPosition = 0;
            CurrentSplit = 0;
        }