Question

我需要处理一些CSV数据，并且无法找到匹配重复项的方法。

数据看起来有点像这样：

line    id    name   item_1    item_2    item_3    item_4
1      251   john    foo       foo       foo       foo
2      251   john    foo       bar       bar       bar
3      251   john    foo       bar       baz       baz
4      251   john    foo       bar       baz       pat

在这种情况下，

第1-3行是重复的。

line    id    name   item_1    item_2    item_3    item_4
5      347   bill    foo       foo       foo       foo
6      347   bill    foo       bar       bar       bar

在这种情况下，只有第5行是重复的

line    id    name   item_1    item_2    item_3    item_4
7      251   mary    foo       foo       foo       foo
8      251   mary    foo       bar       bar       bar
9      251   mary    foo       bar       baz       baz

这里第7行和第8行是重复的

所以基本上如果模式添加了一个新的“项目” 上一行是重复的。我想最终为每个人留下一条线，无论他们有多少项

我正在使用Ruby 1.9.3：

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')

CSV.open("output-file", "wb") do |csv|
    #write the first row (header) to the output file
    csv << people[0]
    people.each do |p|
        ... logic to test for dupe ...
        csv << p.unique
    end
end

Answer 1

首先，您的代码存在轻微错误。而不是：

csv << people[0]

如果您不想更改循环代码，则需要执行以下操作：

csv << people.shift

现在，以下解决方案将仅添加第一次出现的人，丢弃由id确定的任何后续重复项（因为我假设ID是唯一的）。

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
ids = [] # or you could use a Set

CSV.open("output-file", "wb") do |csv|
  #write the first row (header) to the output file
  csv << people.shift
  people.each do |p|
    # If the id of the current records is in the ids array, we've already seen 
    # this person
    next if ids.include?(p[0])

    # Now add the new id to the front of the ids array since the example you gave
    # the duplicate records directly follow the original, this will be slightly
    # faster than if we added the array to the end, but above we still check the
    # entire array to be safe
    ids.unshift p[0]
    csv << p
  end
end

请注意，如果您的重复记录始终直接遵循原始记录，则有一个更高性能的解决方案，您只需要保留最后一个原始ID并检查当前记录的ID而不是包含在整个阵列。如果您的输入文件不包含许多记录，则差异可以忽略不计。

看起来像这样：

require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
previous_id = nil

CSV.open("output-file", "wb") do |csv|
  #write the first row (header) to the output file
  csv << people.shift
  people.each do |p|
    next if p[0] == previous_id
    previous_id = p[0]
    csv << p
  end
end

Answer 2

听起来您正在尝试获取与每个人相关联的唯一项目列表，其中某个人由ID和名称标识。如果这是对的，你可以这样做：

peoplehash = {}
maxitems = 0
people.each do |id, name, *items|
    (peoplehash[[id, name]] ||= []) += items
peoplehash.keys.each do |k|
    peoplehash[k].uniq!
    peoplehash[k].sort!
    maxitems = [maxitems, peoplehash[k].size].max

这将为您提供如下结构：

{
    [251, "john"] => ["bar", "bat", "baz", "foo"],
    [347, "bill"] => ["bar", "foo"]
}

和maxitems告诉你最长项目数组的长度，然后您可以将其用于所需的任何内容。

Answer 3

您可以使用'uniq'

irb(main):009:0> row= ['ruby', 'rails', 'gem', 'ruby']
irb(main):010:0> row.uniq
=> ["ruby", "rails", "gem"]
or 

row.uniq!
=> ["ruby", "rails", "gem"]

irb(main):017:0> row
=> ["ruby", "rails", "gem"]

irb(main):018:0> row = [1,      251,   'john',    'foo',       'foo',       'foo',       'foo']
=> [1, 251, "john", "foo", "foo", "foo", "foo"]
irb(main):019:0> row.uniq
=> [1, 251, "john", "foo"]

ruby CSV重复行解析

3 个答案: