Question

我需要用逗号分隔字符串，这些逗号不会出现在带引号的子字符串中。我的方法是

使用一些特殊标记
用逗号分隔字符串，然后
用逗号（在拆分字符串中）替换令牌的出现次数。

我意识到这可能是一种更简单的方法，但现在我只是感兴趣的是为什么命名组替换不起作用，如下所述。

我有一个正则表达式，用于将引用的子字符串中的逗号标识为命名捕获commahere：

COMMA_INSIDE_QUOTES_REGEX = /
  (?<quote>[\"\'])      # start by finding either single or double quote
  (?<postquote>.*?)     # then lazy capture any other chars until...
  (?<commahere>\,)      # ...we find the comma
  (?<postcomma>.*?)     # then lazy capture any other chars until...
  (\k<quote>)           # ...we find the matching single or double quote
/x

在以下测试字符串中，正则表达式匹配de,f和jk,a,l，而不是其他符号，正如我所期望的那样。

str = 'abc,"de,f",ghi,"jk,a,l"'
COMMA_INSIDE_QUOTES_REGEX.match(str)
#=> #<MatchData "\"de,f\"" quote:"\"" postquote:"de" commahere:"," postcomma:"f">

但是当我使用gsub用特殊标记替换命名的捕获时，整个匹配被替换，而不是命名组（再加上两个逗号！）：

COMMA_TOKEN = '<--COMMA-->'
str.gsub(COMMA_INSIDE_QUOTES_REGEX,"\\k<commahere>#{COMMA_TOKEN}")
#=> "abc,,<--COMMA-->,ghi,,<--COMMA-->"

Answer 1

你误会了什么。

str.gsub(COMMA_INSIDE_QUOTES_REGEX,"\\k<commahere>#{COMMA_TOKEN}")

表示：

尝试匹配字符串COMMA_INSIDE_QUOTES_REGEX中的正则表达式str。
如果成功，请将整个匹配替换为根据<commahere>的内容和COMMA_TOKEN的内容构建的字符串。

这并不意味着“只用它后面的任何内容替换组<commahere>。你的方法是错误的，你试图做的事情不能按照你试图做的方式完成。您确实应该接受mu的建议并使用CSV解析器。

如果你对正则表达式看起来真的有用，那就必须像这样构建：

匹配逗号。
检查此逗号是否在字符串中。这可以通过计算逗号后面的引号数来完成。如果该数字为奇数，则逗号位于字符串内。
即使引号嵌入字符串本身，上一个技巧仍然有效，因为这些引号是通过加倍来转义的。

所以，这是你的正则表达式：

result = str.gsub(
    /,        # Match a comma
    (?!       # only if it's not followed by
     (?:      # the following group:
      [^"]*"  #  any number of non-quote characters and a quote
      [^"]*"  #  twice (so exactly two quotes are matched)
     )*       # any number of times (including 0)
     [^"]*    # followed (if at all) by only non-quote characters
     \Z       # until the end of the string.
    )         # End of lookahead
    /x, '<--COMMA-->')

Answer 2

这就是gsub的工作原理。 gsub用替换字符串替换整个匹配项。否则，gsub如何知道要替换的整个匹配项的哪个子字符串？这些信息在哪里？

为了排除子字符串被包含在替换部分中，您必须使用lookback，negative lookback，lookahead或negative lookahead，具体取决于您的需要。但是，回顾不允许具有可变长度的字符串，因此您可以使用quote和postcomma的回顾或前瞻，但必须在替换字符串中重现postquote部分。 / p>

您的正则表达式还有其他几个问题。像"，,这样的常量子串很容易被称为。使用quote或commahere等名称捕获它们没有意义。此外，您似乎不知道如何在正则表达式中构造替换字符串。如果你想用替代字符串替换它，你不应该在替换字符串中有\k<commahere>。

当用regexp替换时，Ruby gsub不遵守命名组

2 个答案: