我从SQL Server 2008获得了一个包含以下行的CSV转储:
Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
parse_dbenhur
很漂亮,但可以重写以支持逗号和引号的存在吗? parse_ugly
很丑陋。
# @dbenhur's excellent answer, which works 100% for what i originally asked for
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/
FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
def parse_dbenhur(line)
line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end
def parse_ugly(line)
dumb_fields = line.chomp.split(',').map { |v| v.gsub(/\s+/, ' ') }
fields = []
open = false
dumb_fields.each_with_index do |v, i|
open ? fields.last.concat(v) : fields.push(v)
open = (v.start_with?('"') and (v.count('"') % 2 == 1) and dumb_fields[i+1] and dumb_fields[i+1].start_with?(' ')) || (open and !v.end_with?('"'))
end
fields.map { |v| (v.start_with?('"') and v.end_with?('"')) ? v[1..-2] : v }
end
lines = []
lines << 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00'
lines << 'Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00'
lines << 'Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00'
require 'csv'
lines.each do |line|
puts
puts line
begin
c = CSV.parse_line(line)
puts "#{c.to_csv.chomp} (size #{c.length})"
rescue
puts "FasterCSV says: #{$!}"
end
a = parse_ugly(line)
puts "#{a.to_csv.chomp} (size #{a.length})"
b = parse_dbenhur(line)
puts "#{b.to_csv.chomp} (size #{b.length})"
end
这是我运行时的输出:
Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
FasterCSV says: Illegal quoting in line 1.
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
FasterCSV says: Unclosed quoted field on line 1.
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS""",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS""""",1997-05-15 00:00:00 (size 5)
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS"" FOOBAR",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS"" FOOBAR""",1997-05-15 00:00:00 (size 5)
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 (size 4)
Construction,198120036B,"""""MERITER""","""DO IT CTR"""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
Construction,198120036B,"""""""MERITER""""","""""DO IT CTR"""""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
更新
请注意,当字段包含逗号时,CSV会使用双引号。
更新2
如果逗号从相关字段中删除,那就没问题了...我的parse_ugly方法不会保留它们。
更新3
我从客户端了解到,SQL Server 2008是exporting this strange CSV - 已向Microsoft here和here报告了
更新4
HOPEFULLY FINAL UPDATE
此代码有效(我认为它在语义上是正确的):
QUOTED = /"((?:[^"]|(?:""(?!")))*)"/
SEPQ = /,(?! )/
UNQUOTED = /([^,]*)/
SEPU = /,(?=(?:[^ ]|(?: +[^",]*,)))/
FIELD = /(?:#{QUOTED}#{SEPQ})|(?:#{UNQUOTED}#{SEPU})|\Z/
def parse_sql_server_2008_csv_line(line)
line.scan(FIELD)[0...-1].map{ |matches| (matches[0] || matches[1]).tr(',', ' ').gsub(/\s+/, ' ') }
end
改编自@dbenhur和@ ghostdog74在How can I process a CSV file with “bad commas”?
中的回答答案 0 :(得分:1)
如果您的CSV不使用双引号作为合法的引用字符,请将选项调整为CSV以传递:quote_char => "\0"
,然后您可以执行此操作(为了清晰起见,包裹字符串)
1.9.3p327 > puts 'Construction,197133031B,"MORGAN SHOES" ALT,
1997-05-13 00:00:00'.parse_csv(:quote_char => "\0")
Construction
197133031B
"MORGAN SHOES" ALT
1997-05-13 00:00:00
1.9.3p327 > puts 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,
1996-08-09 00:00:00'.parse_csv(:quote_char => "\0")
Plumbing
196222006P
REPLACE LEAD WATER SERVICE W/1" COPPER
1996-08-09 00:00:00
答案 1 :(得分:1)
以下使用regexp和String#scan
。我观察到,在您正在处理的CSV格式中,"
只有在字段的开头和结尾时才有引用属性。
扫描在连续匹配正则表达式的字符串中移动,因此正则表达式可以假设其起始匹配点是字段的开头。我们构造了正则表达式,因此它可以匹配平衡的引用字段,没有内部引号(QUOTED
)或一串非逗号(UNQUOTED
)。当匹配任何替代字段表示时,它必须后跟一个分隔符,该分隔符可以是逗号或字符串结尾(SEP
)
因为UNQUOTED
可以在分隔符之前匹配零长度字段,所以扫描始终匹配我们用[0...-1]
丢弃的末尾的空字段。 Scan生成一组元组;每个元组都是一个捕获组的数组,因此我们map
在每个元素上使用matches[0] || matches[1]
选择捕获的替代元素。
你的示例行都没有显示包含逗号和引号的字段 - 我不知道它是如何合法表示的,而且这段代码可能无法正确识别这样的字段。
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/
FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
def ugly_parse line
line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end
lines.each do |l|
puts l
puts ugly_parse(l).inspect
puts
end
# Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
# ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
#
# Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
# ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
#
# Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
# ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]