描述

Question

我知道这个（或类似的）已被多次询问，但尝试了很多可能性，但我找不到100％正常工作的正则表达式。

我有一个CSV文件，我正在尝试将其拆分为数组，但遇到两个问题：引用逗号和空元素。

CSV看起来像：

123,2.99,AMO024,Title,"Description, more info",,123987564

我试图使用的正则表达式是：

thisLine.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)

唯一的问题是在我的输出数组中，第5个元素是123987564而不是空字符串。

Answer 1

描述

除了使用拆分之外，我认为简单地执行匹配并处理所有找到的匹配会更容易。

此表达式将：

将您的示例文本划分为逗号分隔
将处理空值
将忽略双引号，提供双引号不嵌套
修剪返回值
修剪返回值

正则表达式：(?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

enter image description here

实施例

示例文字

123,2.99,AMO024,Title,"Description, more info",,123987564

使用非java表达式的ASP示例

Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.MultiLine = True
sourcestring = "your source string"
regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
Set Matches = regEx.Execute(sourcestring)
  For z = 0 to Matches.Count-1
    results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
    For zz = 0 to Matches(z).SubMatches.Count-1
      results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
    next
    results=Left(results,Len(results)-1) & chr(13)
  next
Response.Write "<pre>" & results

使用非java表达式匹配

组0获取包含逗号
的整个子字符串如果使用了第1组，则获得报价第2组获得的值不包括逗号

[0][0] = 123
[0][1] = 
[0][2] = 123

[1][0] = ,2.99
[1][1] = 
[1][2] = 2.99

[2][0] = ,AMO024
[2][1] = 
[2][2] = AMO024

[3][0] = ,Title
[3][1] = 
[3][2] = Title

[4][0] = ,"Description, more info"
[4][1] = "
[4][2] = Description, more info

[5][0] = ,
[5][1] = 
[5][2] = 

[6][0] = ,123987564
[6][1] = 
[6][2] = 123987564

Answer 2

稍微研究了一下这个解决方案：

(?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))

Try it out here!

此解决方案处理“漂亮”的CSV数据，如

"a","b",c,"d",e,f,,"g"

0: "a"
1: "b"
2: c
3: "d"
4: e
5: f
6:
7: "g"

和更加丑陋的事情如

"""test"" one",test' two,"""test"" 'three'","""test 'four'"""

0: """test"" one"
1: test' two
2: """test"" 'three'"
3: """test 'four'"""

这是explanation of how it works：

(?:,|\n|^)      # all values must start at the beginning of the file,  
                #   the end of the previous line, or at a comma  
(               # single capture group for ease of use; CSV can be either...  
  "             # ...(A) a double quoted string, beginning with a double quote (")  
    (?:         #        character, containing any number (0+) of  
      (?:"")*   #          escaped double quotes (""), or  
      [^"]*     #          non-double quote characters  
    )*          #        in any order and any number of times  
  "             #        and ending with a double quote character  

  |             # ...or (B) a non-quoted value  

  [^",\n]*      # containing any number of characters which are not  
                # double quotes ("), commas (,), or newlines (\n)  

  |             # ...or (C) a single newline or end-of-file character,  
                #           used to capture empty values at the end of  
  (?:\n|$)      #           the file or at the ends of lines  
)

Answer 3

我也需要这个答案，但我找到了答案，虽然提供了信息，但有点难以理解和复制其他语言。这是我为CSV行中的单个列提出的最简单的表达式。我没有分裂。我正在构建一个正则表达式以匹配CSV中的列，所以我不会拆分该行：

("([^"]*)"|[^,]*)(,|$)

这匹配CSV行中的单个列。表达式的第一部分"([^"]*)"用于匹配带引号的条目，第二部分[^,]*用于匹配未引用的条目。然后是,或行尾$。

随附的debuggex来测试表达式。

https://www.debuggex.com/r/s4z_Qi2gZiyzpAhx

Answer 4

我迟到了，但以下是我使用的正则表达式：

(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

此模式有三个捕获组：

引用单元格的内容
未加引号的单元格的内容
新行

此模式处理以下所有内容：

没有任何特殊功能的正常单元格内容： 1,2,3
包含双引号的单元格（“已转义为”“）：没有引用，”a“”引用“”的事情“，结束
单元格包含换行符：一个，两个\ nthree，四个
具有内部引用的正常单元格内容：一，二“三，四
单元格包含引号后跟逗号：一个，“两个”“三个”“，四个”，五个

See this pattern in use.

如果您正在使用具有命名组和外观的更强大的正则表达式，我更喜欢以下内容：

(?<quoted>(?<=,"|^")(?:""|[\w\W]*?)*(?=",|"$))|(?<normal>(?<=,(?!")|^(?!"))[^,]*?(?=(?<!")$|(?<!"),))|(?<eol>\r\n|\n)

See this pattern in use.

修改

(?:^"|,")(""|[\w\W]*?)(?=",|"$)|(?:^(?!")|,(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

这个略微修改的模式处理第一列为空的行，只要您不使用Javascript即可。出于某种原因，Javascript将省略具有此模式的第二列。我无法正确处理这种边缘情况。

Answer 5

将JScript用于经典ASP页面的优点是，您可以使用为JavaScript编写的众多库中的一个。

像这样：https://github.com/gkindel/CSV-JS。下载它，将其包含在您的ASP页面中，用它解析CSV。

<%@ language="javascript" %>

<script language="javascript" runat="server" src="scripts/csv.js"></script>
<script language="javascript" runat="server">

var text = '123,2.99,AMO024,Title,"Description, more info",,123987564',
    rows = CSV.parse(line);

    Response.Write(rows[0][4]);
</script>

Answer 6

我亲自尝试了很多RegEx表达式而没有找到适合所有情况的完美表达。

我认为正则表达式很难正确配置以正确匹配所有情况。虽然很少有人不喜欢命名空间（我也是其中的一部分），但我提出的内容是.Net框架的一部分，并且在所有情况下始终给我正确的结果（主要是很好地管理每个双引号的情况）：

Microsoft.VisualBasic.FileIO.TextFieldParser

在此处找到：StackOverflow

使用示例：

TextReader textReader = new StringReader(simBaseCaseScenario.GetSimStudy().Study.FilesToDeleteWhenComplete);
Microsoft.VisualBasic.FileIO.TextFieldParser textFieldParser = new TextFieldParser(textReader);
textFieldParser.SetDelimiters(new string[] { ";" });
string[] fields = textFieldParser.ReadFields();
foreach (string path in fields)
{
    ...

希望它可以提供帮助。

Answer 7

在Java中，这种模式",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))" 几乎对我有用：

String text = "\",\",\",,\",,\",asdasd a,sd s,ds ds,dasda,sds,ds,\"";
String regex = ",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))";
Pattern p = Pattern.compile(regex);
String[] split = p.split(text);
for(String s:split) {
    System.out.println(s);
}

输出：

","
",a,,"

",asdasd a,sd s,ds ds,dasda,sds,ds,"

缺点：当列具有奇数引号时不起作用：（

Answer 8

Aaaand这里的另一个答案。 :)因为我无法使其他人完全工作。

我的解决方案都处理转义引号（双重引号），并且匹配中不包含分隔符。

请注意，我一直在与'而不是"进行匹配，因为这是我的方案，但只是在模式中替换它们以获得相同的效果。

此处（如果您使用下面的注释版本，请记得使用“忽略空白”标记/x）：

# Only include if previous char was start of string or delimiter
(?<=^|,)
(?:
  # 1st option: empty quoted string (,'',)
  '{2}
  |
  # 2nd option: nothing (,,)
  (?:)
  |
  # 3rd option: all but quoted strings (,123,)
  # (included linebreaks to allow multiline matching)
  [^,'\r\n]+
  |
  # 4th option: quoted strings (,'123''321',)
  # start pling
  ' 
    (?:
      # double quote
      '{2}
      |
      # or anything but quotes
      [^']+
    # at least one occurance - greedy
    )+
  # end pling
  '
)
# Only include if next char is delimiter or end of string
(?=,|$)

单行版本：

(?<=^|,)(?:'{2}|(?:)|[^,'\r\n]+|'(?:'{2}|[^']+)+')(?=,|$)

Regular expression visualization (if it works, debux has issues right now it seems - else follow the next link)

Debuggex Demo

regex101 example

Answer 9

我正在使用这个，它适用于昏迷分隔符和双引号转义。通常这应该解决你的问题：

/(?<=^|,)(\"(?:[^"]+|"")*\"|[^,]*)(?:$|,)/g

Answer 10

另一个答案还有一些额外的功能，例如支持包含转义引号和CR / LF字符的引用值（单个值跨越多行）。

注意：虽然以下解决方案可能适用于其他正则表达式引擎，但使用 as-is 将要求您的正则表达式引擎将multiple named capture groups using the same name视为一个单一的捕获组。（.NET默认执行此操作）

当CSV文件/流的多行/记录（匹配RFC standard 4180）传递给下面的正则表达式时，它将返回每个非空行/记录的匹配项。每个匹配项都将包含一个名为Value的捕获组，其中包含该行/记录中捕获的值（如果在行尾有一个打开的引号，则可能包含OpenValue捕获组记录）

这是注释模式（测试它on Regexstorm.net）：

(?<=\r|\n|^)(?!\r|\n|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>(?!")[^,\r\n]+)|                 // Unquoted value -or-
    "(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|   // Open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  (?:,|(?=\r|\n|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:\r\n|\r|\n|$)                              // Records end at EOL

这是没有所有注释或空格的原始模式。

(?<=\r|\n|^)(?!\r|\n|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,\r\n]+)|"(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|(?<Value>))(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)

Here is a visualization from Debuggex.com（为清晰起见而命名的捕获组）： Debuggex.com visualization

有关如何使用正则表达式模式的示例，请参阅我对类似问题here或C# pad here或here的回答。

Answer 11

我使用这个表达。它考虑了我遇到的逗号后的空格。

(?:,"|^"|, ")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

Answer 12

这个匹配c＃所需的全部内容：

(?<=(^|,)(?<quote>"?))([^"]|(""))*?(?=\<quote>(?=,|$))

剥离报价
让新线
在引用的字符串
在引用的字符串

Answer 13

,?\s*'.+?'|,?\s*".+?"|[^"']+?(?=,)|[^"']+

这个正则表达式适用于单引号和双引号，也适用于另一个引用！

Answer 14

如果你知道你没有空字段（,,）那么这个表达式效果很好：

Route::delete('/project/{project}', function (Project $project) {
    DB::table('pics')->where('projectId', '=', $project->id)->delete();
    $project -> delete();
    return redirect('/');
});

如下例所示......

("[^"]*"|[^,]+)

但是，如果您预计空字段并且文本相对较小，则可能会考虑在解析之前用空格替换空字段以确保捕获它们。例如......

Set rx = new RegExp
rx.Pattern = "(""[^""]*""|[^,]+)"
rx.Global = True
Set col = rx.Execute(sText)
For n = 0 to col.Count - 1
    if n > 0 Then s = s & vbCrLf
    s = s & col(n)
Next

如果您需要维护字段的完整性，可以恢复逗号并测试循环内的空白区域。这可能不是最有效的方法，但它可以完成工作。

Answer 15

如果我在http://regex101.com上使用＆＃39; g＆＃39;尝试@chubbsondubs发布的正则表达式flag，有匹配，只包含＆＃39;，＆＃39;或一个空字符串。使用此正则表达式：
(?:"([^"]*)"|([^,]*))(?:[,])我可以匹配CSV的各个部分（包含引用的部分）。（该行必须以＆＃39;终止，否则最后一部分不会被识别。）
https://regex101.com/r/dF9kQ8/4
如果CSV看起来像：
"",huhu,"hel lo",world,
有4场比赛：
＆＃39;＆＃39;
＆＃39;乎乎＆＃39;
＆＃39; hel lo＆＃39;
＆＃39;世界＆＃39;

Answer 16

我有类似的需要从SQL插入语句中拆分CSV值。

在我的情况下，我可以假设字符串用单引号括起来而且数字不是。

csv.split(/,((?=')|(?=\d))/g).filter(function(x) { return x !== '';});

由于一些可能明显的原因，这个正则表达式会产生一些空白结果。我可以忽略这些，因为我的数据中的任何空值都表示为...,'',...而不是...,,...。

Answer 17

正确的正则表达式匹配单个带引号的值与其中的转义[doubled]单引号：

'([^n']|(''))+'

正则表达式拆分CSV

17 个答案:

描述

实施例