Question

我正在使用此正则表达式来解析APEX中的CSV行：

Pattern csvPattern = Pattern.compile('(?:^|,)(?:\"([^\"]+|\"\")*\"|([^,]+)*)');

效果很好，但每次匹配返回两组（一组用于引用值，另一组用于非引用值）。见下文：

Matcher csvMatcher = csvPattern.matcher('"hello",world');
Integer m = 1;
while (csvMatcher.find()) {
    System.debug('Match ' + m);
    for (Integer i = 1; i <= csvMatcher.groupCount(); i++) {
        System.debug('Capture group ' + i + ': ' + csvMatcher.group(i));
    }
    m++;
}

运行此代码将返回以下内容：

[5]|DEBUG|Match 1
[7]|DEBUG|Capture group 1: hello
[7]|DEBUG|Capture group 2: null
[5]|DEBUG|Match 2
[7]|DEBUG|Capture group 1: null
[7]|DEBUG|Capture group 2: world

我希望每个匹配只返回非空捕获。这可能吗？

Answer 1

这实际上是一件很难的事它可以通过前瞻/后面的断言来完成虽然不是很直观。

它看起来像这样：
(?:^|,)(\s*"(?=(?:[^"]+|"")*"\s*(?:,|$)))?((?<=")(?:[^"]+|"")*(?="\s*(?:,|$))|[^,]*)

工作原理是在有效引用字段的第一个引用"之后排列文本正文。如果它不是有效的引用字段，则它在引号本身上排列。此时，文本正文可以在单个捕获缓冲区中捕获为未引用的字段，或者作为引用字段减去引号。

这可能是一个功能正则表达式，它可以在不需要残留代码的情况下提供精确的解决方案。我可能会遗漏一些东西，但如果没有外观断言，我认为没有办法做到这一点。所以，你的引擎必须支持它。如果没有，你将不得不像上面的解决方案那样选择它。

这是Perl中的原型，在它下面有一个注释扩展的正则表达式祝你好运！

$samp = '  "hello " , world",,me,and,th""is, or , "tha""t"  ';

$regex = '
  (?: ^ | , )
  (\s*" (?= (?:[^"]+|"")* " \s*(?:,|$) ) )?
  (
     (?<=") (?:[^"]+|"")* (?="\s*(?:,|$) )
   |
     [^,]*
  )
';
while ($samp =~ /$regex/xg)
{
   print "'$2'\n";
}

输出

'hello '
' world"'
''
'me'
'and'
'th""is'
' or '
'tha""t'

(?: ^ | , )          # Consume comma (or BOL is fine)

(                    # Capture group 1, capture '"' only if a complete quoted field
   \s*                  # Optional many spaces
   "
   (?=                  # Lookahead, check for a valid quoted field, determines if a '"' will be consumed
      (?:[^"]+|"")*
      "
      \s*
      (?:,|$)
   )
)?                   # End capt grp 1. 0 or 1 quote

(                    # Capture group 2, the body of text
   (?<=")                 # If there is a '"' behind us, we have consumed a '"' in capture grp 1, so this is valid
   (?:[^"]+|"")*
   (?="\s*(?:,|$) )
 |                      # OR,
   [^,]*                  # Just get up to the next ',' This could be incomplete quoted fields
)                    # End capt grp 2

扩展程序

如果实际上你可以使用它，可以加速使用反向引用的引用字段
而不是两次匹配引用字段。反向引用通常解析为单个字符串
在C语言中比较api如strncmp()，使其更快作为旁注，可以修剪非引用字段的字段体之前/之后的空格在正则表达式中加一点额外的符号。
祝好运！

压缩

(?:^|,)(?:\s*"(?=((?:[^"]+|"")*)"\s*(?:,|$)))?((?<=")\1|[^,]*)

扩展

(?: ^|, )
(?: \s* " (?=  ( (?:[^"]+|"")* )  " \s*  (?: ,|$ )  ))?
( (?<=") \1 | [^,]* )

扩充评论

(?: ^ | , )          # Consume comma (or BOL is fine)

(?:                  # Start grouping
   \s*                  # Spaces, then double quote '"' (consumed if valid quoted field)
   "                    #
   (?=                  # Lookahead, nothing consumed (check for valid quoted field)
      (                     # Capture grp 1
         (?:[^"]+|"")*          # Body of quoted field  (stored for later consumption)
      )                     # End capt grp 1
      "                     # Double quote '"'
      \s*                   # Optional spaces
      (?: , | $ )           # Comma or EOL
   )                    # End lookahead
)?                   # End grouping, optionaly matches and consumes '\s*"'

(                    # Capture group 2, consume FIELD BODY
   (?<=")                 # Lookbehind, if there is a '"' behind us the field is quoted
   \1                     # Consume capt grp 1
 |                      # OR,
   [^,]*                  # Invalid-quoted or Non-quoted field, get up to the next ','
)                    # End capt grp 2

Answer 2

从ruakh获得一些灵感，我更新了正则表达式，每个匹配只返回一个捕获组（并在字段和空格中处理引号）。

(?:^|[\s]*?,[\s]*)(\"(?:(?:[^\"]+|\"\")*)[^,]*|(?:[^,])*)

如何为每个匹配仅返回非空捕获组？

2 个答案: