使用TextFieldParser处理包含未转义双引号的字段

时间:2013-04-25 22:40:07

标签: c# parsing csv file-io

我正在尝试使用TextFieldParser导入CSV文件。特定的CSV文件由于其非标准格式而导致我出现问题。有问题的CSV的字段用双引号括起来。当特定字段中还有一组未转义的双引号时,会出现此问题。

这是一个过于简单的测试案例,突出了问题。我正在处理的实际CSV文件并非所有格式都相同,并且有许多字段,其中任何字段都可能包含这些可能很棘手的格式问题。

TextReader reader = new StringReader("\"Row\",\"Test String\"\n" +
    "\"1\",\"This is a test string.  It is parsed correctly.\"\n" +
    "\"2\",\"This is a test string with a comma,  which is parsed correctly\"\n" +
    "\"3\",\"This is a test string with double \"\"double quotes\"\". It is parsed correctly\"\n" +
    "\"4\",\"This is a test string with 'single quotes'. It is parsed correctly\"\n" +
    "5,This is a test string with fields that aren't enclosed in double quotes.  It is parsed correctly.\n" +
    "\"6\",\"This is a test string with single \"double quotes\".  It can't be parsed.\"");

using (TextFieldParser parser = new TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };
    while (!parser.EndOfData)
    {
        string[] fields= parser.ReadFields();
        Console.WriteLine("This line was parsed as:\n{0},{1}",
            fields[0], fields[1]);
    }
}

无论如何使用TextFieldParser正确解析具有此类格式的CSV?

5 个答案:

答案 0 :(得分:7)

我同意Hans Passant的建议,即解析格式错误的数据不是您的责任。但是,根据Robustness Principle,面临这种情况的某些人可能会尝试处理特定类型的格式错误的数据。我在下面编写的代码适用于问题中指定的数据集。基本上它检测到格式错误的行上的解析器错误,确定它是否是基于第一个字符的双引号包装,然后手动拆分/剥离所有包装双引号。

using (TextFieldParser parser = new TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };

    while (!parser.EndOfData)
    {
        string[] fields = null;
        try
        {
            fields = parser.ReadFields();
        }
        catch (MalformedLineException ex)
        {
            if (parser.ErrorLine.StartsWith("\""))
            {
                var line = parser.ErrorLine.Substring(1, parser.ErrorLine.Length - 2);
                fields = line.Split(new string[] { "\",\"" }, StringSplitOptions.None);
            }
            else
            {
                throw;
            }
        }
        Console.WriteLine("This line was parsed as:\n{0},{1}", fields[0], fields[1]);
    }
}

我确信有可能编造一个失败的病态示例(例如,在字段值中与双引号相邻的逗号)但是任何这样的示例在最严格意义上可能是不可解析的,而问题行在尽管形象不正确,这个问题仍然是难以理解的。

答案 1 :(得分:1)

Jordan的解决方案相当不错,但是它错误地假设错误行将始终以双引号开头。我的错误是:

170,"CMS ALT",853,,,NON_MOVEX,COM,NULL,"2014-04-25",""  204 Route de Trays"

请注意,最后一个字段有多余的/未转义的双引号,但第一个字段很好。因此,乔丹的解决方案不起作用。这是我根据Jordan修改后的解决方案:

using(TextFieldParser parser = new TextFieldParser(new StringReader(csv))) {
 parser.Delimiters = new [] {","};

 while (!parser.EndOfData) {
  string[] fields = null;
  try {
   fields = parser.ReadFields();
  } catch (MalformedLineException ex) {
   string errorLine = SafeTrim(parser.ErrorLine);
   fields = errorLine.Split(',');
  }
 }
}

您可能希望以不同的方式处理catch块,但是一般概念对我来说很有用。

答案 2 :(得分:0)

手动执行此操作可能更容易,而且它肯定会为您提供更多控制权:

编辑: 对于您澄清的示例,我仍然建议手动处理解析:

using System.IO;

string[] csvFile = File.ReadAllLines(pathToCsv);
foreach (string line in csvFile)
{
    // get the first comma in the line
    // everything before this index is the row number
    // everything after is the row value
    int firstCommaIndex = line.IndexOf(',');

    //Note: SubString used here is (startIndex, length) 
    string row = line.Substring(0, firstCommaIndex+1);
    string rowValue = line.Substring(firstCommaIndex+1).Trim();

    Console.WriteLine("This line was parsed as:\n{0},{1}",
            row, rowValue);
}

对于字段中不允许使用逗号的通用CSV:

using System.IO;

string[] csvFile = File.ReadAllLines(pathToCsv);
foreach (string line in csvFile)
{
    string[] fields = line.Split(',');
    Console.WriteLine("This line was parsed as:\n{0},{1}",
            fields[0], fields[1]);
}

答案 3 :(得分:0)

工作解决方案:

using (TextFieldParser csvReader = new TextFieldParser(csv_file_path))
            {
                csvReader.SetDelimiters(new string[] { "," });
                csvReader.HasFieldsEnclosedInQuotes = false;
                string[] colFields = csvReader.ReadFields();

                while (!csvReader.EndOfData)
                {
                    string[] fieldData = csvReader.ReadFields();
                    for (i = 0; i < fieldData.Length; i++)
                    {
                        if (fieldData[i] == "")
                        {
                            fieldData[i] = null;
                        }
                        else
                        {
                            if (fieldData[i][0] == '"' && fieldData[i][fieldData[i].Length - 1] == '"')
                            {
                                fieldData[i] = fieldData[i].Substring(1, fieldData[i].Length - 2);
                            }
                        }
                    }
                    csvData.Rows.Add(fieldData);
                   }
            }

答案 4 :(得分:-1)

在开始阅读文件之前,请在TextFieldParser对象上设置 HasFieldsEnclosedInQuotes = true