ruby CSV重复行解析

时间:2012-03-07 13:24:21

标签: ruby parsing csv ruby-1.9

我需要处理一些CSV数据,并且无法找到匹配重复项的方法。

数据看起来有点像这样:

line    id    name   item_1    item_2    item_3    item_4
1      251   john    foo       foo       foo       foo
2      251   john    foo       bar       bar       bar
3      251   john    foo       bar       baz       baz
4      251   john    foo       bar       baz       pat
在这种情况下,

第1-3行是重复的。

line    id    name   item_1    item_2    item_3    item_4
5      347   bill    foo       foo       foo       foo
6      347   bill    foo       bar       bar       bar

在这种情况下,只有第5行是重复的

line    id    name   item_1    item_2    item_3    item_4
7      251   mary    foo       foo       foo       foo
8      251   mary    foo       bar       bar       bar
9      251   mary    foo       bar       baz       baz

这里第7行和第8行是重复的

所以基本上如果模式添加了一个新的“项目” 上一行是重复的。 我想最终为每个人留下一条线,无论他们有多少项

我正在使用Ruby 1.9.3:

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')

CSV.open("output-file", "wb") do |csv|
    #write the first row (header) to the output file
    csv << people[0]
    people.each do |p|
        ... logic to test for dupe ...
        csv << p.unique
    end
end

3 个答案:

答案 0 :(得分:3)

首先,您的代码存在轻微错误。而不是:

csv << people[0]

如果您不想更改循环代码,则需要执行以下操作:

csv << people.shift

现在,以下解决方案将仅添加第一次出现的人,丢弃由id确定的任何后续重复项(因为我假设ID是唯一的)。

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
ids = [] # or you could use a Set

CSV.open("output-file", "wb") do |csv|
  #write the first row (header) to the output file
  csv << people.shift
  people.each do |p|
    # If the id of the current records is in the ids array, we've already seen 
    # this person
    next if ids.include?(p[0])

    # Now add the new id to the front of the ids array since the example you gave
    # the duplicate records directly follow the original, this will be slightly
    # faster than if we added the array to the end, but above we still check the
    # entire array to be safe
    ids.unshift p[0]
    csv << p
  end
end

请注意,如果您的重复记录始终直接遵循原始记录,则有一个更高性能的解决方案,您只需要保留最后一个原始ID并检查当前记录的ID而不是包含在整个阵列。如果您的输入文件不包含许多记录,则差异可以忽略不计。

看起来像这样:

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
previous_id = nil

CSV.open("output-file", "wb") do |csv|
  #write the first row (header) to the output file
  csv << people.shift
  people.each do |p|
    next if p[0] == previous_id
    previous_id = p[0]
    csv << p
  end
end

答案 1 :(得分:1)

听起来您正在尝试获取与每个人相关联的唯一项目列表,其中某个人由ID和名称标识。如果这是对的,你可以这样做:

peoplehash = {}
maxitems = 0
people.each do |id, name, *items|
    (peoplehash[[id, name]] ||= []) += items
peoplehash.keys.each do |k|
    peoplehash[k].uniq!
    peoplehash[k].sort!
    maxitems = [maxitems, peoplehash[k].size].max

这将为您提供如下结构:

{
    [251, "john"] => ["bar", "bat", "baz", "foo"],
    [347, "bill"] => ["bar", "foo"]
}

maxitems告诉你最长项目数组的长度,然后您可以将其用于所需的任何内容。

答案 2 :(得分:0)

您可以使用'uniq'

irb(main):009:0> row= ['ruby', 'rails', 'gem', 'ruby']
irb(main):010:0> row.uniq
=> ["ruby", "rails", "gem"]
or 

row.uniq!
=> ["ruby", "rails", "gem"]

irb(main):017:0> row
=> ["ruby", "rails", "gem"]

irb(main):018:0> row = [1,      251,   'john',    'foo',       'foo',       'foo',       'foo']
=> [1, 251, "john", "foo", "foo", "foo", "foo"]
irb(main):019:0> row.uniq
=> [1, 251, "john", "foo"]