Question

传入的数据文件包含格式错误的CSV数据（如非转义引号）以及（有效）CSV数据（如包含新行的字段）。如果检测到CSV格式错误，我想对该数据使用替代例程。

使用以下示例代码（简称为简称）

FasterCSV.open( file ){|csv|
  row = true
  while row
    begin
      row = csv.shift
      break unless row
      # Do things with the good rows here...

    rescue FasterCSV::MalformedCSVError => e
      # Do things with the bad rows here...
      next
    end
  end
}

MalformedCSVError是在csv.shift方法中引起的。如何从rescue子句中访问导致错误的数据？

Answer 1

require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV

# File.open('test.txt','r').each do |line|
DATA.each do |line|
  begin
    CSV.parse(line) do |row|
      p row #handle row
    end
  rescue  CSV::MalformedCSVError => er
    puts er.message
    puts "This one: #{line}"
    # and continue
  end
end

# Output:

# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]   

__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid

只需将文件逐行提供给FasterCSV并挽救错误。

Answer 2

这真的很难。有些事情使FasterCSV更好，更快，这使得这一点变得特别困难。这是我最好的建议：FasterCSV可以包装IO对象。那么你可能做的是创建自己的File子类（本身是IO的子类），它“保留”最后{{3}的结果}}。然后，当FasterCSV引发异常时，您可以向特殊File对象询问最后一行。像这样：

class MyFile < File
  attr_accessor :last_gets
  @last_gets = ''

  def gets(*args)
    line = super
    @last_gets << $/ << line
    line
  end
end

# then...

file  = MyFile.open(filename, 'r')
csv   = FasterCSV.new file

row = true
while row
  begin
    break unless row = csv.shift

    # do things with the good row here...

  rescue FasterCSV::MalformedCSVError => e
    bad_row = file.last_gets

    # do something with bad_row here...

    next
  ensure
    file.last_gets = '' # nuke the @last_gets "buffer"
  end
end

有点整洁，对吗？ 但是！当然有一些警告：

我不确定当您为每个gets电话添加额外步骤时，您的性能影响有多大。如果您需要及时解析数百万行文件，可能会出现问题。
如果您的CSV文件在引用字段中包含换行符，则 ~~完全失败~~ 可能会也可能不会失败。原因是gets - 基本上，如果引用的值包含换行符，那么shift必须执行额外的gets调用才能获得整行。可能有一个聪明的方法来解决这个限制，但它现在还没有找到我。如果您确定您的文件在引用字段中没有任何换行符，那么这不应该让您担心。

你的其他选项是使用File.gets读取文件并将每行依次传递给described in the source但我很确定你这样做会浪费使用FasterCSV获得的任何性能优势。

Answer 3

在CSV尝试解析之前，我使用Jordan的文件子类化方法来修复输入数据的问题。在我的情况下，我有一个文件使用\“来转义引号，而不是CSV所期望的”。因此，

class MyFile < File
  def gets(*args)
    line = super
    if line != nil
      line.gsub!('\\"','""')  # fix the \" that would otherwise cause a parse error
    end
    line
  end
end

infile = MyFile.open(filename)
incsv = CSV.new(infile)

while row = infile.shift
  # process each row here
end

这允许我解析非标准CSV文件。 Ruby的CSV实现非常严格，并且通常会遇到CSV格式的许多变体。

如何进一步处理导致Ruby FasterCSV库抛出MalformedCSVError的数据行？

3 个答案: