我有一个具有200万行的csv文件,文件大小为2 GB。但是由于有几个自由文本形式的列,它们包含冗余的CRLF并导致文件无法加载到SQL Server表中。我收到一个错误,即最后一列不以“。”结尾。
我有以下代码,但是当从fileName读取时,它给出了OutOfMemoryException。该行是:
var lines = File.ReadAllLines(fileName);
我该如何解决?理想情况下,我想将文件分为好行和坏行。或删除不以“ CRLF”结尾的行。
int goodRow = 0;
int badRow = 0;
String badRowFileName = fileName.Substring(0, fileName.Length - 4) + "BadRow.csv";
String goodRowFileName = fileName.Substring(0, fileName.Length - 4) + "GoodRow.csv";
var charGood = "\"\"";
String lineOut = string.Empty;
String str = string.Empty;
var lines = File.ReadAllLines(fileName);
StringBuilder sbGood = new StringBuilder();
StringBuilder sbBad = new StringBuilder();
foreach (string line in lines)
{
if (line.Contains(charGood))
{
goodRow++;
sbGood.AppendLine(line);
}
else
{
badRow++;
sbBad.AppendLine(line);
}
}
if (badRow > 0)
{
File.WriteAllText(badRowFileName, sbBad.ToString());
}
if (goodRow > 0)
{
File.WriteAllText(goodRowFileName, sbGood.ToString());
}
sbGood.Clear();
sbBad.Clear();
msg = msg + "Good Rows - " + goodRow.ToString() + " Bad Rows - " + badRow.ToString() + " Done.";
答案 0 :(得分:2)
您可以像这样更高效地翻译该代码:
int goodRow = 0, badRow = 0;
String badRowFileName = fileName.Substring(0, fileName.Length - 4) + "BadRow.csv";
String goodRowFileName = fileName.Substring(0, fileName.Length - 4) + "GoodRow.csv";
var charGood = "\"\"";
using (var lines = File.ReadLines(fileName))
using (var swGood = new StreamWriter(goodRowFileName))
using (var swBad = new StreamWriter(badRowFileName))
{
foreach (string line in lines)
{
if (line.Contains(charGood))
{
goodRow++;
swGood.WriteLine(line);
}
else
{
badRow++;
swBad.WriteLine(line);
}
}
}
msg += $"Good Rows: {goodRow,9} Bad Rows: {badRow,9} Done.";
但是我还要考虑使用真正的csv解析器。 NuGet上有很多东西。甚至可以让您即时清除数据。
答案 1 :(得分:1)
我不建议将整个文件读入内存,然后处理该文件,然后将所有修改后的内容写到新文件中。
代替使用文件流:
using (var rdr = new StreamReader(fileName))
using (var wrtrGood = new StreamWriter(goodRowFileName))
using (var wrtrBad = new StreamWriter(badRowFileName))
{
string line = null;
while ((line = rdr.ReadLine()) != null)
{
if (line.Contains(charGood))
{
goodRow++;
wrtr.WriteLine(line);
}
else
{
badRow++;
wrtrBad.WriteLine(line);
}
}
}