Question

我需要使用嵌入式单引号和双引号进行简单的解析。对于以下输入：

" hello    'there   ok \"hohh\"   '   ciao    \"eeee  \"   \"  yessss 'aaa'  \"   %%55+ "

我需要以下输出：

["hello", "there   ok \"hohh\"   ", "ciao", "eeee  ", "  yessss 'aaa'  ", "%%55+"]

为什么我想出的以下Ruby代码有效？我不明白正则表达式部分。我知道基本的正则表达式，但我认为嵌入式引号不应该工作，但它们仍然有效，或者单个双引号，反之亦然。

text.scan(/\"(.*?)\"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}

Answer 1

无需使用自定义正则表达式解决此问题; the ruby standard library contains a module for this: Shellwords

操纵类似UNIX Bourne shell
的字符串
此模块根据UNIX Bourne shell的单词解析规则处理字符串。

用法：

require 'shellwords'

str = " hello    'there   ok \"hohh\"   '   ciao    \"eeee  \"   \"  yessss 'aaa'  \"   %%55+ "

Shellwords.split(str)
  #=> ["hello", "there   ok \"hohh\"   ", "ciao", "eeee  ", "  yessss 'aaa'  ", "%%55+"]
# Or equivalently:
str.shellsplit
  #=> ["hello", "there   ok \"hohh\"   ", "ciao", "eeee  ", "  yessss 'aaa'  ", "%%55+"]

以上是“正确”的答案。用那个。以下是解释为什么使用它的其他信息，以及为什么你的答案“排序”有效。

准确地解析这些字符串很棘手！您的正则表达式尝试适用于大多数输入，但不能正确处理各种边缘情况。例如，考虑：

str = "foo\\ bar"

str.shellsplit
  #=> ["foo bar"] (correct!)

str.scan(/\"(.*?)\"|'(.*?)'|([^\s]+)/).flatten.select{|x|x}
  #=> ["foo\\", "bar"] (wrong!)

method's implementation仍然使用（更复杂的！）正则表达式，但也处理边缘情况，如无效输入 - 你的不是。

line.scan(/\G\s*(?>([^\s\\\'\"]+)|'([^\']*)'|"((?:[^\"\\]|\\.)*)"|(\\.?)|(\S))(\s|\z)?/m)

所以不要深入研究你的方法的缺陷（但足以说，它并不总是有用！），为什么主要有效？好吧，你的正则表达式：

/\"(.*?)\"|'(.*?)'|([^\s]+)/

......说：

如果找到"，请将尽可能 （.*?），直至结束"。

与上面相同，对于单引号（'）。

如果找不到单引号或双引号，请向前扫描第一个非空白字符（[^\s]+ - 也可以等同地写为\S+）。

.flatten是必要的，因为您正在使用捕获组（(...)）。如果您使用非捕获组（(?:...)），则可以避免这种情况。

由于这些捕获组，.select{|x|x}或（有效）等效.compact也是必要的 - 因为在每次比赛中，3组中有2组不属于结果。

为什么以下解析解决方案有效？

1 个答案:

操纵类似UNIX Bourne shell