如何在Ruby中拆分CSV字符串?

时间:2010-10-14 12:28:06

标签: ruby regex csv split

我在CSV文件中以此行为例:

2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"

我想把它拆分成一个数组。直接的想法只是在逗号上分开,但是其中一些字符串中有逗号,例如“生命和生活过程,生命过程”,这些应该作为数组中的单个元素保留。还要注意,两个逗号之间没有任何内容 - 我想将它们作为空字符串。

换句话说,我想得到的数组是

[2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes","","",1,0,"endofline"]

我可以想到涉及eval的hacky方式,但我希望有人可以拿出一个干净的正则表达式来做... ...

欢呼,最大

6 个答案:

答案 0 :(得分:9)

这不是正则表达式的合适任务。你需要一个CSV解析器,Ruby有一个内置的:

http://ruby-doc.org/stdlib/libdoc/csv/rdoc/classes/CSV.html

一个可以说是优越的第三部分库:

http://fastercsv.rubyforge.org/

答案 1 :(得分:3)

str=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
require 'csv' # built in

p CSV.parse(str)
# That's it! However, empty fields appear as nil.
# Makes sense to me, but if you insist on empty strings then do something like:
parser = CSV.new(str)
parser.convert{|field| field.nil? ? "" : field}
p parser.readlines

答案 2 :(得分:2)

编辑:我没有阅读Ruby标签。好消息是,指南将解释构建此背后的理论,即使语言细节不正确。遗憾。

这是一个很棒的指南:

http://knab.ws/blog/index.php?/archives/10-CSV-file-parser-and-writer-in-C-Part-2.html

和csv作家在这里:

http://knab.ws/blog/index.php?/archives/3-CSV-file-parser-and-writer-in-C-Part-1.html

这些示例涵盖了在csv中引用文字的情况(可能包含也可能不包含逗号)。

答案 3 :(得分:2)

text=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
x=[]
text.chomp.split("\042").each_with_index do |y,i|
  i%2==0 ?  x<< y.split(",") : x<<y
end
print x.flatten

输出

$ ruby test.rb
["2412", "21", "Which of the following is not found in all cells?", "Curriculum", "Life and Living Processes, Life Processes", "", "", "", "1", "0", "endofline"]

答案 4 :(得分:1)

今天早上,我偶然发现了Ruby-on-Rails的CSV表导入器项目。最终你会发现代码有用:

Github TableImporter

答案 5 :(得分:0)

我的首选是@steenstag的解决方案,但是另一种方法是将String#scan与以下正则表达式结合使用。

r = /(?<![^,])(?:(?!")[^,\n]*(?<!")|"[^"\n]*")(?![^,])/

如果变量str包含示例中给出的字符串,我们将获得:

puts str.scan r

显示

2412
21
"Which of the following is not found in all cells?"
"Curriculum"
"Life and Living Processes, Life Processes"


1
0
"endofline"

Start your engine!

另请参见regex101,其中提供了有关正则表达式的每个标记的详细说明。 (将光标移到正则表达式上。)

Ruby的正则表达式引擎执行以下操作。

(?<![^,]) : negative lookbehind assert current location is not preceded
            by a character other than a comma
(?:       : begin non-capture group
  (?!")   : negative lookahead asserts next char is not a double-quote
  [^,\n]* : match 0+ chars other than a comma and newline
  (?<!")  : negative lookbehind asserts preceding character is not a
            double-quote
  |       : or
  "       : match double-quote
  [^"\n]* : match 0+ chars other than double-quote and newline
  "       : match double-quote
)         : end of non-capture group
(?![^,])  : negative lookahead asserts current location is not followed
            by a character other than a comma

请注意,(?<![^,])(?<=,|^)相同,(?![^,])(?=^|,)相同。