根据选定的标题从CSV中提取重复记录

时间:2011-08-23 11:28:52

标签: php ruby arrays algorithm csv

我有一个包含数千个客户详细信息的CSV文件。我想根据所选标题的相同值来提取重复的客户。

例如,我想提取所有存在多条记录且具有相同“姓氏”和“邮政编码”的客户。

"surname","postcode","other-stuff-that-doesn't-matter"...
"smith",  "AB1 2CD", "dxfh"...
"smith",  "AB1 2CD", "98sf"...
"jones",  "BC2 3DE", "as0j"...
"jones",  "BC2 3DE", "9as6"...
"blogs",  "BC2 3DE", "9as6"...

基于以上所述,程序将返回一个新的CSV,如下所示:

"surname","postcode","other-stuff-that-doesn't-matter"...
"smith",  "AB1 2CD", "dxfh"...
"smith",  "AB1 2CD", "98sf"...
"jones",  "BC2 3DE", "as0j"...
"jones",  "BC2 3DE", "9as6"...

修改

感谢您的帮助。我想我有一个有效的解决方案,但我很想知道这是否可以优化(我确信它可以!)。

set_one    = Set.new
set_two    = Set.new
duplicates = Array.new
headers    = nil

CSV.foreach('customers.csv', :headers => true, :header_converters => :symbol) do |row|
  headers = row.headers unless headers
  values = [row[:surname], row[:post_code]]
  if set_one.include? values
    set_two << values
  else
    set_one << values 
  end
end

CSV.foreach('customers.csv', :headers => true, :header_converters => :symbol) do |row|
  values = [row[:surname], row[:post_code]]
  if set_two.include? values
    duplicates << row
  end
end

CSV.open("duplicate-customers.csv", "wb") do |csv|
  csv << headers
  duplicates.each { |dupe| csv << dupe }
end

1 个答案:

答案 0 :(得分:3)

让我们先读一下csv(不处理转义或引号,只是一个例子)

csv = []
columns = []
File.read('csv.file') do |row|
  if csv.empty?
    columns=row.split(',')
  else
    row_data={}
    row.split(',').each_with_index do |c,i|
      row_data[columns[i]] = c
    end
    csv << row_data
  end
end

好的,我们如何处理数据?它看起来像:

[{'surname' => 'smith', 'postcode' => '1234', 'otherstuff' => 'xyz' },
 {'surname' => 'jones', 'postcode' => '1234', 'otherstuff' => 'xyz' },
 {'surname' => 'smith', 'postcode' => '2345', 'otherstuff' => 'xyz' },
 {'surname' => 'smith', 'postcode' => '1234', 'otherstuff' => 'xyz' }]

如下:

csv.select do |c| 
  csv.any? do |s| 
    s['surname'].eql?(c['surname']) && s['postcode'].eql?(c['postcode']) 
  end
end

好的,这很慢而且不聪明。让我们继续解决方案2,从我们想要检查唯一性的数据中生成一个哈希键:

sneakyhash={}
csv.each do |row|
  magic_string = [row['surname'], row['postcode']].join("--MaGiCaL--SpLiTTinG--StRiNG--")
  if sneakyhash[magic_string].nil?
    sneakyhash[magic_string] = 1
  else
    puts "this guy looks suspicious: " + row.join(,)
  end
end

远非最佳,但只是在这里大声思考。如果它只是“一次性”的东西,而你只需要解析一个文件,那么请选择你能想到的东西。

您可能想要做的是在读取csv时将此标识字符串存储在数组或散列中,并查看当前行是否与任何存储的唯一行匹配,如果有,则执行某些操作。