Ruby:使用不规则字段清理CSV

时间:2016-05-17 00:22:00

标签: ruby-on-rails ruby csv

我有一个包含非常不规则条目的CSV文件。行的第一个条目没有任何周围的引号,引用整行,并且每个字段都是双引号,如下所示:

# my_file.csv, opened with sublime text :

# Headers
"first_name,""last_name"",""username"",""phone_number"",""address"",""email_address"",""email_address_confirmed"",""joined_at"",""status"",""is_admin"",""accept_emails_from_admin"",""language"",""can_post_listings"""

# Sample entry
"Mr X,""Mr X"",""mrxxx"","""","""",""mr@mrx.com"",""true"",""2015-09-21 09:08:51 UTC"",""accepted"",""true"",""true"",""fr"",""true"""

我可以使用Ruby以外的其他东西预处理文件(Excel,简单的正则表达式/替换,或者你能想到的任何东西),但由于我可能不得不多次这样做,所以Ruby解决方案会很棒。

目前我只使用

csv = File.open(csv_file_path)
CSV.parse(csv, :headers => true)

而且我真的不知道如何才能轻松解决这一差异只是为了每行的第一个条目...

问题是CSV未正确解析,而是将每行视为一个单独的字符串(而不是包含与列一样多的项目的数组)。

# csv.headers : note this is an array with a single string
["first_name,\"last_name\",\"username\",\"phone_number\",\"address\",\"email_address\",\"email_address_confirmed\",\"joined_at\",\"status\",\"is_admin\",\"accept_emails_from_admin\",\"language\",\"can_post_listings\""]

# csv.to_a.last
["xxx,\"xxxx\",\"martin\",\"\",\"\",\"xxx@xxxx.com\",\"false\",\"2016-05-12 13:06:53 UTC\",\"pending_email_confirmation\",\"false\",\"true\",\"fr\",\"false\""]
编辑:我尝试了以下

processed = File.readlines(path).map do |row|
    row.strip                 # strip newlines
      .gsub(/^\"|\"$/, '')   # remove outer quotes
      .gsub(/\"\"/, '"')     # fix double quotes
end
CSV.parse(processed.join('\n'))

我遇到了CSV::MalformedCSVError: Missing or stray quote in line 1

示例输出

# File.readlines(path).first
# => "\"first_name,\"\"last_name\"\",\"\"username\"\",\"\"phone_number\"\",\"\"address\"\",\"\"email_address\"\",\"\"email_address_confirmed\"\",\"\"joined_at\"\",\"\"status\"\",\"\"is_admin\"\",\"\"accept_emails_from_admin\"\",\"\"language\"\",\"\"can_post_listings\"\"\"\n"

# processed.first
# => "first_name,\"last_name\",\"username\",\"phone_number\",\"address\",\"email_address\",\"email_address_confirmed\",\"joined_at\",\"status\",\"is_admin\",\"accept_emails_from_admin\",\"language\",\"can_post_listings\""

编辑2

哎呀,有时我会有一些嵌套的逗号,@ Dave的答案似乎对这些案件都没有。有这个领域

  

"" 45,street_addr - 地点""

其中包含不是分隔符的逗号。 完整条目

"Mr x,""Mr xx"",""bbernelin"","""",""45, street_addr - Place"",""xxx@xxx.fr"",""true"",""2016-04-13 11:14:08 UTC"",""accepted"",""false"",""true"",""fr"",""true"""

3 个答案:

答案 0 :(得分:2)

从我可以看出,整行都有引用它,然后一些字段是双引号。修复使CSV解析器满意,所以这似乎有效:

require 'csv'

processed = DATA.map do |row|
  row.strip                 # strip newlines
     .gsub(/^\"|\"$/, '')   # remove outer quotes
     .gsub(/\"\"/, '"')     # fix double quotes
end

CSV.parse(processed.join('\n'), headers: true) do |row|
  p row
end

__END__
"first_name,""last_name"",""username"",""phone_number"",""address"",""email_address"",""email_address_confirmed"",""joined_at"",""status"",""is_admin"",""accept_emails_from_admin"",""language"",""can_post_listings"""
"Mr X,""Mr X"",""mrxxx"","""","""",""mr@mrx.com"",""true"",""2015-09-21 09:08:51 UTC"",""accepted"",""true"",""true"",""fr"",""true"""

结果:

#<CSV::Row "first_name":"Mr X" "last_name":"Mr X" "username":"mdxxx"
"phone_number":"" "address":"" "email_address":"mr@mrx.com" 
"email_address_confirmed":"true" "joined_at":"2015-09-21 09:08:51 UTC" 
"status":"accepted" "is_admin":"true" "accept_emails_from_admin":"true" 
"language":"fr" "can_post_listings":"true">

答案 1 :(得分:1)

看起来有

  • 每个条目周围有0个或更多引号
  • 每个条目之间只有1个逗号
  • 任何条目中都没有逗号或引号

因此,您可以使用1引号替换每个条目周围的所有引号:

csv = gsub(/(?<=^|,)"*([^,"\n]*)"*(?=,|$)/, %Q("\\1"))

评论正则表达式:

/
  (?<=^|,)    # pattern is preceded by the beginning of the string or a comma
  "*          # any number of "
  ([^,"\n]*)  # any number of characters, not , " or newline
  "*          # any number of "
  (?=,|$)     # pattern is followed by the end of the string or a comma
/

似乎在您的示例中产生了正确的结果:

csv = %Q("first_name,""last_name"",""username"",""phone_number"",""address"",""email_address"",""email_address_confirmed"",""joined_at"",""status"",""is_admin"",""accept_emails_from_admin"",""language"",""can_post_listings"""\n) +
      %Q("Mr X,""Mr X"",""mrxxx"","""","""",""mr@mrx.com"",""true"",""2015-09-21 09:08:51 UTC"",""accepted"",""true"",""true"",""fr"",""true""")
CSV.parse(csv.gsub(/(?<=^|,)"*([^,"\n]*)"*(?=,|$)/, %Q("\\1")), headers: true).to_a
=> [
     ["first_name", "last_name", "username", "phone_number", "address", "email_address", "email_address_confirmed", "joined_at", "status", "is_admin", "accept_emails_from_admin", "language", "can_post_listings"],
     ["Mr X", "Mr X", "mrxxx", "", "", "mr@mrx.com", "true", "2015-09-21 09:08:51 UTC", "accepted", "true", "true", "fr", "true"]
   ]

答案 2 :(得分:1)

好吧,我最终得到了:

processed = File.readlines(path).map do |row|
    row.strip.gsub('""', '"')[1..-2]
end.join("\n")
CSV.parse(processed)

[1..-2]只会删除线路开头/结尾的额外"