我有一个包含数千个客户详细信息的CSV文件。我想根据所选标题的相同值来提取重复的客户。
例如,我想提取所有存在多条记录且具有相同“姓氏”和“邮政编码”的客户。
"surname","postcode","other-stuff-that-doesn't-matter"...
"smith", "AB1 2CD", "dxfh"...
"smith", "AB1 2CD", "98sf"...
"jones", "BC2 3DE", "as0j"...
"jones", "BC2 3DE", "9as6"...
"blogs", "BC2 3DE", "9as6"...
基于以上所述,程序将返回一个新的CSV,如下所示:
"surname","postcode","other-stuff-that-doesn't-matter"...
"smith", "AB1 2CD", "dxfh"...
"smith", "AB1 2CD", "98sf"...
"jones", "BC2 3DE", "as0j"...
"jones", "BC2 3DE", "9as6"...
感谢您的帮助。我想我有一个有效的解决方案,但我很想知道这是否可以优化(我确信它可以!)。
set_one = Set.new
set_two = Set.new
duplicates = Array.new
headers = nil
CSV.foreach('customers.csv', :headers => true, :header_converters => :symbol) do |row|
headers = row.headers unless headers
values = [row[:surname], row[:post_code]]
if set_one.include? values
set_two << values
else
set_one << values
end
end
CSV.foreach('customers.csv', :headers => true, :header_converters => :symbol) do |row|
values = [row[:surname], row[:post_code]]
if set_two.include? values
duplicates << row
end
end
CSV.open("duplicate-customers.csv", "wb") do |csv|
csv << headers
duplicates.each { |dupe| csv << dupe }
end
答案 0 :(得分:3)
让我们先读一下csv(不处理转义或引号,只是一个例子)
csv = []
columns = []
File.read('csv.file') do |row|
if csv.empty?
columns=row.split(',')
else
row_data={}
row.split(',').each_with_index do |c,i|
row_data[columns[i]] = c
end
csv << row_data
end
end
好的,我们如何处理数据?它看起来像:
[{'surname' => 'smith', 'postcode' => '1234', 'otherstuff' => 'xyz' },
{'surname' => 'jones', 'postcode' => '1234', 'otherstuff' => 'xyz' },
{'surname' => 'smith', 'postcode' => '2345', 'otherstuff' => 'xyz' },
{'surname' => 'smith', 'postcode' => '1234', 'otherstuff' => 'xyz' }]
如下:
csv.select do |c|
csv.any? do |s|
s['surname'].eql?(c['surname']) && s['postcode'].eql?(c['postcode'])
end
end
好的,这很慢而且不聪明。让我们继续解决方案2,从我们想要检查唯一性的数据中生成一个哈希键:
sneakyhash={}
csv.each do |row|
magic_string = [row['surname'], row['postcode']].join("--MaGiCaL--SpLiTTinG--StRiNG--")
if sneakyhash[magic_string].nil?
sneakyhash[magic_string] = 1
else
puts "this guy looks suspicious: " + row.join(,)
end
end
远非最佳,但只是在这里大声思考。如果它只是“一次性”的东西,而你只需要解析一个文件,那么请选择你能想到的东西。
您可能想要做的是在读取csv时将此标识字符串存储在数组或散列中,并查看当前行是否与任何存储的唯一行匹配,如果有,则执行某些操作。