Question

我正在尝试使用以下正则表达式分割CSV输入：

(?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

数据为,a,b,c的行/行会产生3个匹配项：

，B
，C

我正在失去/错过,a，我无法弄清楚需要改变什么。

似乎可以使用Python选项： https://regex101.com/r/kW3pQ6/1

知道如何为.NET修复它吗？

这可能会有所帮助：

(?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

Regular expression visualization

Debuggex Demo

Answer 1

正如其他人所建议的那样，你应该使用一个用于解析CSV字符串的类。 TextFieldParser类内置于.NET中。除非您的问题中未提及其他要求，否则无需使用外部库。

using(MemoryStream stream = new MemoryStream())
using(StreamWriter writer = new StreamWriter(stream))
{
    writer.Write(s);
    writer.Flush();
    stream.Position = 0;

    using(TextFieldParser parser = new TextFieldParser(stream)){
        parser.TextFieldType = FieldType.Delimited;
        parser.Delimiters = new string[] {","};
        parser.HasFieldsEnclosedInQuotes = true;

        while(!parser.EndOfData){ //Loop through lines until we reach the end of the file
            string[] fields = parser.ReadFields(); //This will contain your fields
        }
    }
}

https://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser%28v=vs.110%29.aspx

Answer 2

为什么不使用Csv NuGet包，它考虑到你现在试图解决的CSV解析的许多细微差别以及其他你不知道你还需要解决的问题： - ）

CsvHelper是一个非常流行的操作系统包：
https://www.nuget.org/packages/CsvHelper
https://github.com/JoshClose/CsvHelper

Answer 3

是的，我知道正则表达不是＆＃34;对＆＃34;回答，但这是问题所要求的，我喜欢一个很好的正则表达式挑战。

注意：虽然以下解决方案可能适用于其他正则表达式引擎，但使用 as-is 将要求您的正则表达式引擎将multiple named capture groups using the same name视为一个单一的捕获组。（.NET默认执行此操作）

当CSV文件/流的多行/记录（匹配RFC standard 4180）传递给下面的正则表达式时，它将返回每个非空行/记录的匹配项。每个匹配项都将包含一个名为Value的捕获组，其中包含该行/记录中捕获的值（如果在行尾有一个打开的引号，则可能包含OpenValue捕获组记录）

这是注释模式（测试它on Regexstorm.net）：

(?<=\r|\n|^)(?!\r|\n|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>(?!")[^,\r\n]+)|                 // Unquoted value -or-
    "(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|   // Open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  (?:,|(?=\r|\n|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:\r\n|\r|\n|$)                              // Records end at EOL

这是没有所有注释或空格的原始模式。

(?<=\r|\n|^)(?!\r|\n|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,\r\n]+)|"(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|(?<Value>))(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)

Here is a visualization from Debuggex.com（为清晰起见而命名的捕获组）： Debuggex.com visualization

有关如何使用正则表达式模式的示例，请参阅我对类似问题here或C# pad here或here的回答。

CSV Regex拆分缺少的列

3 个答案: