我正在尝试使用TextFieldParser导入CSV文件。特定的CSV文件由于其非标准格式而导致我出现问题。有问题的CSV的字段用双引号括起来。当特定字段中还有一组未转义的双引号时,会出现此问题。
这是一个过于简单的测试案例,突出了问题。我正在处理的实际CSV文件并非所有格式都相同,并且有许多字段,其中任何字段都可能包含这些可能很棘手的格式问题。
TextReader reader = new StringReader("\"Row\",\"Test String\"\n" +
"\"1\",\"This is a test string. It is parsed correctly.\"\n" +
"\"2\",\"This is a test string with a comma, which is parsed correctly\"\n" +
"\"3\",\"This is a test string with double \"\"double quotes\"\". It is parsed correctly\"\n" +
"\"4\",\"This is a test string with 'single quotes'. It is parsed correctly\"\n" +
"5,This is a test string with fields that aren't enclosed in double quotes. It is parsed correctly.\n" +
"\"6\",\"This is a test string with single \"double quotes\". It can't be parsed.\"");
using (TextFieldParser parser = new TextFieldParser(reader))
{
parser.Delimiters = new[] { "," };
while (!parser.EndOfData)
{
string[] fields= parser.ReadFields();
Console.WriteLine("This line was parsed as:\n{0},{1}",
fields[0], fields[1]);
}
}
无论如何使用TextFieldParser正确解析具有此类格式的CSV?
答案 0 :(得分:7)
我同意Hans Passant的建议,即解析格式错误的数据不是您的责任。但是,根据Robustness Principle,面临这种情况的某些人可能会尝试处理特定类型的格式错误的数据。我在下面编写的代码适用于问题中指定的数据集。基本上它检测到格式错误的行上的解析器错误,确定它是否是基于第一个字符的双引号包装,然后手动拆分/剥离所有包装双引号。
using (TextFieldParser parser = new TextFieldParser(reader))
{
parser.Delimiters = new[] { "," };
while (!parser.EndOfData)
{
string[] fields = null;
try
{
fields = parser.ReadFields();
}
catch (MalformedLineException ex)
{
if (parser.ErrorLine.StartsWith("\""))
{
var line = parser.ErrorLine.Substring(1, parser.ErrorLine.Length - 2);
fields = line.Split(new string[] { "\",\"" }, StringSplitOptions.None);
}
else
{
throw;
}
}
Console.WriteLine("This line was parsed as:\n{0},{1}", fields[0], fields[1]);
}
}
我确信有可能编造一个失败的病态示例(例如,在字段值中与双引号相邻的逗号)但是任何这样的示例在最严格意义上可能是不可解析的,而问题行在尽管形象不正确,这个问题仍然是难以理解的。
答案 1 :(得分:1)
Jordan的解决方案相当不错,但是它错误地假设错误行将始终以双引号开头。我的错误是:
170,"CMS ALT",853,,,NON_MOVEX,COM,NULL,"2014-04-25","" 204 Route de Trays"
请注意,最后一个字段有多余的/未转义的双引号,但第一个字段很好。因此,乔丹的解决方案不起作用。这是我根据Jordan修改后的解决方案:
using(TextFieldParser parser = new TextFieldParser(new StringReader(csv))) {
parser.Delimiters = new [] {","};
while (!parser.EndOfData) {
string[] fields = null;
try {
fields = parser.ReadFields();
} catch (MalformedLineException ex) {
string errorLine = SafeTrim(parser.ErrorLine);
fields = errorLine.Split(',');
}
}
}
您可能希望以不同的方式处理catch块,但是一般概念对我来说很有用。
答案 2 :(得分:0)
手动执行此操作可能更容易,而且它肯定会为您提供更多控制权:
编辑: 对于您澄清的示例,我仍然建议手动处理解析:
using System.IO;
string[] csvFile = File.ReadAllLines(pathToCsv);
foreach (string line in csvFile)
{
// get the first comma in the line
// everything before this index is the row number
// everything after is the row value
int firstCommaIndex = line.IndexOf(',');
//Note: SubString used here is (startIndex, length)
string row = line.Substring(0, firstCommaIndex+1);
string rowValue = line.Substring(firstCommaIndex+1).Trim();
Console.WriteLine("This line was parsed as:\n{0},{1}",
row, rowValue);
}
对于字段中不允许使用逗号的通用CSV:
using System.IO;
string[] csvFile = File.ReadAllLines(pathToCsv);
foreach (string line in csvFile)
{
string[] fields = line.Split(',');
Console.WriteLine("This line was parsed as:\n{0},{1}",
fields[0], fields[1]);
}
答案 3 :(得分:0)
工作解决方案:
using (TextFieldParser csvReader = new TextFieldParser(csv_file_path))
{
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = false;
string[] colFields = csvReader.ReadFields();
while (!csvReader.EndOfData)
{
string[] fieldData = csvReader.ReadFields();
for (i = 0; i < fieldData.Length; i++)
{
if (fieldData[i] == "")
{
fieldData[i] = null;
}
else
{
if (fieldData[i][0] == '"' && fieldData[i][fieldData[i].Length - 1] == '"')
{
fieldData[i] = fieldData[i].Substring(1, fieldData[i].Length - 2);
}
}
}
csvData.Rows.Add(fieldData);
}
}
答案 4 :(得分:-1)
在开始阅读文件之前,请在TextFieldParser对象上设置 HasFieldsEnclosedInQuotes = true 。