上午,
我正在尝试使用StreamReader / StreamWriter拆分大型文本文件(15,000,000行)。有更快的方法吗?
我测试了130,000行,花费了2分40秒,这意味着15,000,000行需要大约5小时,这看起来有点过分。
//Perform split.
public void SplitFiles(int[] newFiles, string filePath, int processorCount)
{
using (StreamReader Reader = new StreamReader(filePath))
{
for (int i = 0; i < newFiles.Length; i++)
{
string extension = System.IO.Path.GetExtension(filePath);
string temp = filePath.Substring(0, filePath.Length - extension.Length)
+ i.ToString();
string FilePath = temp + extension;
if (!File.Exists(FilePath))
{
for (int x = 0; x < newFiles[i]; x++)
{
DataWriter(Reader.ReadLine(), FilePath);
}
}
else
{
return;
}
}
}
}
public void DataWriter(string rowData, string filePath)
{
bool appendData = true;
using (StreamWriter sr = new StreamWriter(filePath, appendData))
{
{
sr.WriteLine(rowData);
}
}
}
感谢您的帮助。
答案 0 :(得分:1)
有分割文件的工具可能胜过您的解决方案 - 例如搜索“逐行拆分”。
如果它们不合适,可以使用解决方案将所有源文件加载到内存中然后写出文件,但考虑到源文件的大小,这可能不合适。
在改进代码方面,一个小的改进是生成目标文件路径(并澄清您使用的源文件路径与目标文件之间的混淆)。每次循环时都不需要重新建立源文件扩展名。
第二个改进(可能是更重要的改进 - 正如评论者所强调的)是关于如何写出目标文件 - 这些文件似乎与源有不同的行数(每个newFiles
条目中的值)您在单个目标文件中指定了您想要的?因此,我建议您为每个条目读取与下一个目标文件相关的所有源文件,然后输出目标而不是重复打开目标文件。您可以“收集”StringBuilder / List等中的行 - 或者直接将它们写入目标文件(但只打开一次)
public void SplitFiles(int[] newFiles, string sourceFilePath, int processorCount)
{
string sourceDirectory = System.IO.Path.GetDirectoryName(sourceFilePath);
string sourceFileName = System.IO.Path.GetFileNameWithoutExtension(sourceFilePath);
string extension = System.IO.Path.GetExtension(sourceFilePath);
using (StreamReader Reader = new StreamReader(sourceFilePath))
{
for (int i = 0; i < newFiles.Length; i++)
{
string destinationFileNameWithExtension = string.Format("{0}{1}{2}", sourceFileName, i, extension);
string destinationFilePath = System.IO.Path.Combine(sourceDirectory, destinationFileNameWithExtension);
if (!File.Exists(destinationFilePath))
{
// Read all the lines relevant to this destination file
// and temporarily store them in memory
StringBuilder destinationText = new StringBuilder();
for (int x = 0; x < newFiles[i]; x++)
{
destinationText.Append(Reader.ReadLine());
}
DataWriter(destinationFilePath, destinationText.ToString());
}
else
{
return;
}
}
}
}
private static void DataWriter(string destinationFilePath, string content)
{
using (StreamWriter sr = new StreamWriter(destinationFilePath))
{
{
sr.Write(content);
}
}
}
答案 1 :(得分:1)
你还没有说清楚,但我假设newFiles
数组的每个元素的值是从原始数据复制到该文件的行数。请注意,目前您没有检测到输入文件末尾有额外数据的情况,或者它比预期的短。我怀疑你想要这样的东西:
public void SplitFiles(int[] newFiles, string inputFile)
{
string baseName = Path.GetFileNameWithoutExtension(inputFile);
string extension = Path.GetExtension(inputFile);
using (TextReader reader = File.OpenText(inputFile))
{
for (int i = 0; i < newFiles.Length; i++)
{
string outputFile = baseName + i + extension;
if (File.Exists(outputFile))
{
// Better than silently returning, I'd suggest...
throw new IOException("File already exists: " + outputFile);
}
int linesToCopy = newFiles[i];
using (TextWriter writer = File.CreateText(outputFile))
{
for (int j = 0; i < linesToCopy; j++)
{
string line = reader.ReadLine();
if (line == null)
{
return; // Premature end of input
}
writer.WriteLine(line);
}
}
}
}
}
请注意,这仍然无法检测是否有任何未使用的输入......在这种情况下,您不清楚自己想要做什么。
代码清晰度的一个选项是将其中间提取为单独的方法:
public void SplitFiles(int[] newFiles, string inputFile)
{
string baseName = Path.GetFileNameWithoutExtension(inputFile);
string extension = Path.GetExtension(inputFile);
using (TextReader reader = File.OpenText(inputFile))
{
for (int i = 0; i < newFiles.Length; i++)
{
string outputFile = baseName + i + extension;
// Could put this into the CopyLines method if you wanted
if (File.Exists(outputFile))
{
// Better than silently returning, I'd suggest...
throw new IOException("File already exists: " + outputFile);
}
CopyLines(reader, outputFile, newFiles[i]);
}
}
}
private static void CopyLines(TextReader reader, string outputFile, int count)
{
using (TextWriter writer = File.CreateText(outputFile))
{
for (int i = 0; i < count; i++)
{
string line = reader.ReadLine();
if (line == null)
{
return; // Premature end of input
}
writer.WriteLine(line);
}
}
}
答案 2 :(得分:0)
我最近不得不为每个2 GB以下的数百个文件(最高1.92 GB)执行此操作,并且我找到的最快的方法(如果您有可用的内存)是StringBuilder。我试过的所有其他方法都很慢。
请注意,这取决于内存。相应地调整“ CurrentPosition = 130000 ”。
string CurrentLine = String.Empty;
int CurrentPosition = 0;
int CurrentSplit = 0;
foreach (string file in Directory.GetFiles(@"C:\FilesToSplit"))
{
StringBuilder sb = new StringBuilder();
using (StreamReader sr = new StreamReader(file))
{
while ((CurrentLine = sr.ReadLine()) != null)
{
if (CurrentPosition == 130000) // Or whatever you want to split by.
{
using (StreamWriter sw = new StreamWriter(@"C:\FilesToSplit\SplitFiles\" + Path.GetFileNameWithoutExtension(file) + "-" + CurrentSplit + "." + Path.GetExtension(file)))
{
// Append this line too, so we don't lose it.
sb.Append(CurrentLine);
// Write the StringBuilder contents
sw.Write(sb.ToString());
// Clear the StringBuilder buffer, so it doesn't get too big. You can adjust this based on your computer's available memory.
sb.Clear();
// Increment the CurrentSplit number.
CurrentSplit++;
// Reset the current line position. We've found 130,001 lines of text.
CurrentPosition = 0;
}
}
else
{
sb.Append(CurrentLine);
CurrentPosition++;
}
}
}
// Reset the integers at the end of each file check, otherwise it can quickly go out of order.
CurrentPosition = 0;
CurrentSplit = 0;
}