Question

我已经逐行读取了一些具有以下格式的CSV值：

30: "NY", 41: "JOHN S.", 36: "HAMPTON", 42: "123 Road Street, NY", 68: "Y"

我需要将其分解为以下内容，以便进一步解析这些项目：

30: "NY"

41: "JOHN S."

36: "HAMPTON"

42: "123 Road Street, NY"（请注意逗号）

...

我正在使用FileHelper库，但它似乎喜欢逐行阅读，尽管我希望它被分隔的,逗号分开。

我有记录课：

[DelimitedRecord(",")]
class BoxRecord
{
    public String record;
}

我通过以下内容检索我希望数组中的几个对象，但它只返回原始行：

DelimitedFileEngine engine = new DelimitedFileEngine(typeof(BoxRecord));
BoxRecord[] boxes = (BoxRecord[])engine.ReadString(boxLine);

我希望boxes[].record包含：

30: "NY"

41: "JOHN S."

36: "HAMPTON"

42: "123 Road Street, NY"

...

它实际包含的内容：

30: "NY", 41: "JOHN S.", 36: "HAMPTON", 42: "123 Road Street, NY", 68: "Y"

Answer 1

获得该行之后，您可以根据下面的linq中断该行以获得您想要的内容：

string input = "30: \"NY\", 41: \"JOHN S.\", " +
   "36: \"HAMPTON\", 42: \"123 Road Street, NY\", 68: \"Y\"";

var tempList = input.Split('\"').ToList();

var result = Enumerable.Range(0, tempList.Count/2)
    .Select(i => string.Join(": "
        , tempList[2*i].Split(new[] { ',', ':' })
           .Single(ss => !string.IsNullOrWhiteSpace(ss))

        , tempList[2*i + 1]));

更新：对我来说这很有意思，这段代码是为了处理你的评论：

var tempList1 = input.Split(':').ToList();

var tempList2 = tempList1.SelectMany((s, index) =>
 {
     if (index == 0 || index == tempList1.Count - 1)
         return new List<string>() { s };

     var subList = s.Split(',');
     return new List<string>
           { 
                string.Concat(subList.Take(subList.Length - 1)),
                subList.Last()
           };
 }).ToList();

var result = Enumerable.Range(0, tempList2.Count / 2)
         .Select(i => string.Join(": ", tempList2[2 * i], tempList2[2 * i + 1]));

Answer 2

从技术上讲，您正在查看的样本不是有效的CSV格式文件。基本上，任何人都提供文件使用文本限定符号 - 双引号“ - 以非标准的方式。传统的使用方式是：

123,"Sue said, ""Hi, this is a test!""",2012-08-15

此陈述应解析如下：

Assert.AreEqual(line.Length, 3);
Assert.AreEqual(line[0], @"123");
Assert.AreEqual(line[1], @"Sue said, ""Hi, this is a test!""");
Assert.AreEqual(line[2], @"2012-08-15");

根据问题中提供的示例CSV，根据我看到的标准，正确的处理基本上应该将引号作为字符串中的常规字符而不是文本限定符。这就是我如何理解你的界限 - 但是如果我错了，请告诉我！

Assert.AreEqual(line.Length, 6);
Assert.AreEqual(line[0], @"30: ""NY""");
Assert.AreEqual(line[1], @" 41: ""JOHN S.""");
Assert.AreEqual(line[2], @" 36: ""HAMPTON""");
Assert.AreEqual(line[3], @" 42: ""123 Road Street");
Assert.AreEqual(line[4], @" NY""");
Assert.AreEqual(line[5], @" 68: ""Y""");

我认为FileHelper正在崩溃，因为它无法确定文本是文本限定还是正确分隔。你绝对最好使用自定义代码来处理这个问题; Cuong Le提供的解决方案似乎对您的解决方案有益。

作为参考，我的C＃CSV库位于：https://code.google.com/p/csharp-csv-reader/

编辑：为了好玩，我想知道是否可以使用正则表达式解码它。您的数据格式一致，即使它不是严格的CSV，所以这可能是您工具箱的其他内容：

String mystring = @"30: ""NY"", 41: ""JOHN S."", 36: ""HAMPTON"", 42: ""123 Road Street, NY"", 68: ""Y""
    20: ""STEVE"", 12: ""JONES"", 96: ""1600 PENNSYLVANIA AVE, NW""
    30: ""NY"", 41: ""JOHN S."", 36: ""HAMPTON"", 42: ""123 Road Street, NY"", 68: ""Y"", 40: 12345";
Regex r = new Regex(@"(?<id>\d*): (""(?<field>[^""]*)""|(?<field>[\d]*))");
MatchCollection mc = r.Matches(mystring);
foreach (Match m in mc) {
    Console.WriteLine("{0}: {1}", m.Groups["id"], m.Groups["field"]);
}

基本上，正则表达式的工作原理是查找两个十进制数字，然后是冒号 - 空格 - 双引号。然后它会找到所有文本，直到它到达另一个双引号。从我的测试中，这也为你在问题中描述的两个测试线产生了正确的匹配。

如果我的正则表达式不正确，可以在这里找到一个漂亮的在线正则表达式测试器：http://gskinner.com/RegExr/ - 尝试将数据复制并粘贴到搜索区域，然后使用此正则表达式字符串作为起点：< / p>

(?<id>\d*): ("(?<field>[^"]*)"|(?<field>[\d]*))

EDIT2：我修复了正则表达式，还考虑了下面评论中引用的“40：12345”值。它现在可以在所有示例中正确检测所有字段。

EDIT3：从另一个请求，这个正则表达式现在支持冒号之前的无限长度数。以下是正则表达式如何工作的快速解释：

(?<id>\d*) - 第一个块称为捕获组 - 捕获组由括号括起。它尝试捕获十进制数字（*）的重复字符串（\d），并将其命名为“id”（?<id>）。
: - 匹配记录之间的冒号空间。
"(?<field>[^"]*)" - 找到一个起始引号，然后是引号（[^"]）以外的大量字符，以另一个引号结束。将结果保存在“字段”中。
(?<field>[\d]*) - 查找任意数量的小数位并将结果保存在“字段”中。请注意，某些正则表达式引擎不支持具有两个具有相同名称的捕获组;你可能需要调用一个“field1”而另一个调用“field2”。

Answer 3

尽管所有“不要重新发明轮子”的帖子我都遇到过（所有这些），这对我来说是最好的解决方案，也是我唯一能找到工作的解决方案。

我尝试过使用FileHelper框架，Cuong Le的回答和VB TextFieldParser。每个人都有不同的工作方式。

我需要能够解析这个（“非标准”CSV格式）。这些行是文件的输入，但不是CSV文件。它们是更大结构的一部分：

30: "NY", 41: "JOHN S.", 36: "HAMPTON", 42: "123 Road Street, NY", 68: "Y", 40: 12345

FileHelper会在引号中用逗号分隔，例如：

123 Road Street, NY

变为

213 Road Street

NY

Cuong Le的回答没有处理这个案例：40: 12345（没有引号的数据值）

TextFieldParser也会在引号中用逗号分隔，比如FileHelper。

我的快速，肮脏，自己动手的解决方案（并且有效！）：

    private List<KeyValuePair<string, string>> SplitBoxLine(String input)
    {
        //SAMPLE input:
        //30: "NY", 41: "JOHN S.", 36: "HAMPTON", 42: "123 Road Street, NY", 68: "Y", 40: 12345

        List<KeyValuePair<string, string>> boxes = new List<KeyValuePair<string, string>>();

        int quoteCount = 0;
        String buffer = "";
        String boxNum = "";
        String boxValue = "";

        for (int i = 0; i < input.Length; i++)
        {
            if (i == input.Length - 1)
            {
                //if the input character at the end ISN'T a quote or comma, add it to the buffer
                //supports the case where the last item is 40: 12345
                if (input[i] != ',' && input[i] != '\"')
                {
                    buffer += input[i];
                }
                boxValue = String.Copy(buffer.Trim());

                //once we have the value, we can create the pair
                KeyValuePair<string, string> pair = new KeyValuePair<string, string>(boxNum, boxValue);
                boxes.Add(pair);

                Console.WriteLine("BOX VALUE [LAST ITEM]: " + boxValue);
            }

            if (input[i] == ':')
            {
                boxNum = String.Copy(buffer.Trim());
                buffer = "";
                Console.WriteLine("BOX NUM: " + boxNum);
            }
            else if (input[i] == '\"')
            {
                quoteCount++;
            }
            else if (input[i] == ',')
            {
                if (quoteCount % 2 == 0) //comma occurs outside of quotes
                {
                    boxValue = String.Copy(buffer.Trim());
                    buffer = "";

                    //once we have the value, we can create the pair
                    KeyValuePair<string, string> pair = new KeyValuePair<string, string>(boxNum, boxValue);
                    boxes.Add(pair);

                    Console.WriteLine("BOX VALUE: " + boxValue);
                }
                else //the comma occurs in some quotes
                {
                    buffer += input[i]; //add the comma, it's just part of the boxValue
                }
            }
            //nothing special about this chacter, add it to the buffer and continue
            else
            {
                buffer += input[i];
            }
        }

        return boxes;
    }

使用FileHelper库来解析CSV字符串，但需要忽略换行符

3 个答案: