Question

我正在尝试解析（在Ruby中）有效的UNIX passwd文件格式：逗号分隔符，带有转义字符\，以便任何转义都应该按字面意思考虑。我正在尝试使用正则表达式，但是我做得很短 - 即使使用Oniguruma进行前瞻/后瞻断言。

基本上，以下所有内容都应该有效：

a,b,c    # => ["a", "b", "c"]
\a,b\,c  # => ["a", "b,c"]
a,b,c\
d        # => ["a", "b", "c\nd"]
a,b\\\,c # => ["a", "b\,c"]

有什么想法吗？

第一反应看起来不错。使用包含

的文件

\a,,b\\\,c\,d,e\\f,\\,\
g

它给出了：

[["\\a,"], [","], ["b\\\\\\,c\\,d,"], ["e\\\\f,"], ["\\\\,"], ["\\\ng\n"], [""]]

非常接近。只要所有内容都在逗号上正确分割，我就不需要在第一遍中完成unescaping。我尝试了Oniguruma并结束了（更长的时间）：

Oniguruma::ORegexp.new(%{
  (?:       # - begins with (but doesn't capture)
    (?<=\A) #   - start of line
    |       #   - (or) 
    (?<=,)  #   - a comma
  )

  (?:           # - contains (but doesn't capture)
    .*?         #   - any set of characters
    [^\\\\]?    #   - not ending in a slash
    (\\\\\\\\)* #   - followed by an even number of slashes
  )*?

  (?:      # - ends with (but doesn't capture)
    (?=\Z) #   - end of line
    |      #   - (or)
    (?=,)) #   - a comma
  },

  'mx'
).scan(s)

Answer 1

试试这个：

s.scan(/((?:\\.|[^,])*,?)/m)

它不会翻译\后面的字符，但之后可以将其作为一个单独的步骤来完成。

Answer 2

我试试CSV课程。

正则表达式解决方案（黑客？）可能看起来像这样：

#!/usr/bin/ruby -w

# contents of test.csv:
#   a,b,c
#   \a,b\,c
#   a,b,c\
#   d
#   a,b\\\,c

file = File.new("test.csv", "r")
tokens = file.read.scan(/(?:\\.|[^,\r\n])*|\r?\n/m)
puts "-----------"
tokens.length.times do |i|
  if tokens[i] == "\n" or tokens[i] == "\r\n"
    puts "-----------"
  else
    puts ">" + tokens[i] + "<"
  end
end
file.close

将产生输出：

-----------
>a<
>b<
>c<
-----------
>\a<
>b\,c<
-----------
>a<
>b<
>c\
d<
-----------
>a<
>b\\\,c<
-----------

使用转义字符解析分隔文本

2 个答案: