如何解析包含引号的制表符分隔的行?

时间:2017-01-13 21:59:32

标签: ruby csv parsing tabs quotes

我使用的是Ruby 2.4。如何解析包含引号字符的制表符分隔的行?这就是我现在发生的事情......

2.4.0 :003 > line = "11\tDave\tO\"malley"
 => "11\tDave\tO\"malley" 
2.4.0 :004 > CSV.parse(line, col_sep: "\t")
CSV::MalformedCSVError: Illegal quoting in line 1.
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1912:in `block (2 levels) in shift'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `each'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1868:in `block in shift'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `loop'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1828:in `shift'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1770:in `each'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `to_a'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1784:in `read'
    from /Users/davea/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/csv.rb:1324:in `parse'
    from (irb):4
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in    `run_command!'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
    from bin/rails:4:in `require'
    from bin/rails:4:in `<main>'

虽然这个例子说明了我的观点,但我无法轻易控制输入的内容。所以,虽然答案可以说是&lt; &#34;在解析之前删除teh字符串中的所有引号,&#34;我想尽可能保留数据。

2 个答案:

答案 0 :(得分:1)

如果您尝试遵守CSV标准,那么这是一个格式错误的文档。 Instad你可能只是暴力强迫它并祈祷数据本身没有标签:

line.split(/\t/)

当您处理这样的数据时,CSV解析库会派上用场:

"1\t2\t\"3a\t3b\"\t4"

更新:如果您准备滥用CSV库,那么您可以这样做:

CSV.parse("11\tDave\tO\"malley", col_sep: "\t", quote_char: "\0")

这基本上会导致报价检测,所以如果有其他数据依赖于正确处理的数据,这可能无法解决。

答案 1 :(得分:0)

&#34; 11 \ tDave \到\&#34;马利&#34;是无效的CSV数据。奇怪的是,答案是使用两个双引号,并引用每个元素

2.3.1 :001 > require 'csv'
 => true 
2.3.1 :002 > line = "\"11\"\t\"Dave\"\t\"O\"\"malley\""
 => "\"11\"\t\"Dave\"\t\"O\"\"malley\"" 
2.3.1 :003 > puts line # for clarity
"11"    "Dave"  "O""malley"
 => nil 
2.3.1 :004 > CSV.parse(line, col_sep: "\t")
 => [["11", "Dave", "O\"malley"]]