我需要处理一些CSV数据,并且无法找到匹配重复项的方法。
数据看起来有点像这样:
line id name item_1 item_2 item_3 item_4
1 251 john foo foo foo foo
2 251 john foo bar bar bar
3 251 john foo bar baz baz
4 251 john foo bar baz pat
在这种情况下,第1-3行是重复的。
line id name item_1 item_2 item_3 item_4
5 347 bill foo foo foo foo
6 347 bill foo bar bar bar
在这种情况下,只有第5行是重复的
line id name item_1 item_2 item_3 item_4
7 251 mary foo foo foo foo
8 251 mary foo bar bar bar
9 251 mary foo bar baz baz
这里第7行和第8行是重复的
所以基本上如果模式添加了一个新的“项目” 上一行是重复的。 我想最终为每个人留下一条线,无论他们有多少项
我正在使用Ruby 1.9.3:
require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
CSV.open("output-file", "wb") do |csv|
#write the first row (header) to the output file
csv << people[0]
people.each do |p|
... logic to test for dupe ...
csv << p.unique
end
end
答案 0 :(得分:3)
首先,您的代码存在轻微错误。而不是:
csv << people[0]
如果您不想更改循环代码,则需要执行以下操作:
csv << people.shift
现在,以下解决方案将仅添加第一次出现的人,丢弃由id确定的任何后续重复项(因为我假设ID是唯一的)。
require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
ids = [] # or you could use a Set
CSV.open("output-file", "wb") do |csv|
#write the first row (header) to the output file
csv << people.shift
people.each do |p|
# If the id of the current records is in the ids array, we've already seen
# this person
next if ids.include?(p[0])
# Now add the new id to the front of the ids array since the example you gave
# the duplicate records directly follow the original, this will be slightly
# faster than if we added the array to the end, but above we still check the
# entire array to be safe
ids.unshift p[0]
csv << p
end
end
请注意,如果您的重复记录始终直接遵循原始记录,则有一个更高性能的解决方案,您只需要保留最后一个原始ID并检查当前记录的ID而不是包含在整个阵列。如果您的输入文件不包含许多记录,则差异可以忽略不计。
看起来像这样:
require 'csv'
puts "loading data"
people = CSV.read('input-file.csv')
previous_id = nil
CSV.open("output-file", "wb") do |csv|
#write the first row (header) to the output file
csv << people.shift
people.each do |p|
next if p[0] == previous_id
previous_id = p[0]
csv << p
end
end
答案 1 :(得分:1)
听起来您正在尝试获取与每个人相关联的唯一项目列表,其中某个人由ID和名称标识。如果这是对的,你可以这样做:
peoplehash = {}
maxitems = 0
people.each do |id, name, *items|
(peoplehash[[id, name]] ||= []) += items
peoplehash.keys.each do |k|
peoplehash[k].uniq!
peoplehash[k].sort!
maxitems = [maxitems, peoplehash[k].size].max
这将为您提供如下结构:
{
[251, "john"] => ["bar", "bat", "baz", "foo"],
[347, "bill"] => ["bar", "foo"]
}
和maxitems
告诉你最长项目数组的长度,然后您可以将其用于所需的任何内容。
答案 2 :(得分:0)
您可以使用'uniq'
irb(main):009:0> row= ['ruby', 'rails', 'gem', 'ruby']
irb(main):010:0> row.uniq
=> ["ruby", "rails", "gem"]
or
row.uniq!
=> ["ruby", "rails", "gem"]
irb(main):017:0> row
=> ["ruby", "rails", "gem"]
irb(main):018:0> row = [1, 251, 'john', 'foo', 'foo', 'foo', 'foo']
=> [1, 251, "john", "foo", "foo", "foo", "foo"]
irb(main):019:0> row.uniq
=> [1, 251, "john", "foo"]