Question

我表面上看起来似乎是一个简单的问题，我希望用ruby解决这个问题，我有一堆带有相关照片ID的颜色，例如

[[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]

我希望处理数据，使其采用以下格式：

2张照片为红色，绿色
3张照片为红色
1张黄色照片

有几点需要注意：

匹配最多颜色的照片/照片在列表中排在第一位，如果匹配的颜色数相同（如上面的红色和黄色），则先将最高数量计算在内。
红色的计数是3，因为2张照片有红色和绿色，第三张照片只有红色。我不会自己显示绿色的结果，因为所有绿色照片都用红色和绿色的条目来计算。
最终，无论数据集有多大，我都只需要显示前5个结果。

我已经编写了一个实现这一目标的算法（见下文），但我很感激任何有关如何使其更快，更优雅的指导。速度是主要考虑因素，我将操作大量数据（一百万个订单），然后如果可能的话，如果它可以变得更优雅，那将是不错的 - 我不认为我写优雅的红宝石代码，我有一个c ++背景。

我知道在ruby中嵌入c和c ++代码可以提高性能，但我真的很想用ruby实现这一点。

非常感谢

beginning = Time.now

ARR = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"red"],[4,"green"],[4,"yellow"],[5,"green"],[5,"red"],[6,"black"]]

# Group the colours by their id.
groups = ARR.group_by {|x| x[0]}

# output for profiling.
puts "After Group BY: #{Time.now - beginning} seconds."

# Remove the id's, as they are no longer useful. Sort the colours alphabetically.
sorted_groups = []
groups.each do |i,j|
  sorted_groups << j.map!{ |x|  x[1]}.sort
end

# Order the colours, so the group containing the most colours comes first.
# Do a secondary sort alphabetically, so that all identical groups are next to each other. 
sorted_groups_in_order = sorted_groups.sort_by { |s| [s.length,s] }.reverse

# Traverse the groups in order to find the index that marks the position of results_to_return unique groups.
# This is to make subsequent processing more efficient, as it will only operate on a smaller subset.
results_to_return = 5
temp = sorted_groups_in_order[0]
combination_count = 0
index = 0

sorted_groups_in_order.each do |e|
 combination_count +=1 if e != temp
 break if combination_count == results_to_return

 index += 1
 temp = e
end

# Iterate through the subset, and count the duplicates.
tags_with_count = Hash.new(0)
sorted_groups_in_order[0..index].each do |v|
  tags_with_count.store(v,tags_with_count[v]+1)
end

# Sort by the number of colours in each subset, the most colours go first.
tags_with_count = tags_with_count.sort { |q,w| w[0].size <=> q[0].size }

# if colour subsets are found in colour supersets, then increment the subset count to reflect this.
tags_with_count.reverse.each_with_index do |object,index|
  tags_with_count.reverse.each_with_index do |object2,index2|
    if (index2 < index) && (object[0]&object2[0] == object2[0])
      object2[1] += object[1]
    end
  end
end

# Sort by the number of colours in each subset, the most colours go first.
# Perform a secondary sort by the count value.
tags_with_count = tags_with_count.sort_by { |s| [s[0].length,s[1]] }.reverse

# print our results.
tags_with_count.each do |l|
  puts l.inspect
end

# output for profiling.
puts "Time elapsed: #{Time.now - beginning} seconds."

Answer 1

查看反映修改后规范的new answer

假设你有＆gt; 1.8.7，您可以使用Array.combination。否则你需要安装ruby permutation gem：

http://permutation.rubyforge.org/

然后。。

data = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]

 # get a hash mapping photo_id to colors
colors_by_photo_id = data.inject(Hash.new {|h,k| h[k] = []})  do |h,a| 
     h[a.first] << a.last
     h
end

 # could use inject here, but i think this is more readable
total_counts = Hash.new{|h,k| h[k] = 0}

 # add up the sum for all combinations
colors_by_photo_id.values.each do |color_array|
  1.upto(color_array.size).each do |i|
     color_array.combination(i){|comb| total_counts[comb.sort] += 1}
  end
end

>> total_counts
=> {["green", "red"]=>2, ["red"]=>3, ["yellow"]=>1, ["green"]=>2}

 # or if you want the output sorted:
>> total_counts.to_a.sort_by{|a,c| -c}
=> [[["red"], 3], [["green", "red"], 2], [["green"], 2], [["yellow"], 1]]

Answer 2

我发布了一个新答案，因为您修改了规范。

在小数据集上以1/3的时间运行，给出相同的输出。

beginning = Time.now
ARR = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"red"],[4,"green"],[4,"yellow"],[5,"green"],[5,"red"],[6,"black"]]

#assemble an array of photos, each photo being an array of sorted colors
photos = ARR.inject(Hash.new {|h,k| h[k] = []})  do |h,a| 
     h[a.first] << a.last
     h
end.values.map{|v| v.sort!}

#count the occurrences of each combination
combination_counts = photos.uniq.inject(Hash.new(0)) {|h,comb| h[comb] = photos.count(comb); h}

#unique combinations
combinations = combination_counts.keys 

#find the 5 largest combinations
top_5 = (1..[combinations.size,5].min).map do 
          combinations.delete( combinations.max {|a,b| a.size <=> b.size} )
        end

#find the top 5, plus extras in case of ties (this replaces the above stricken code)
top_set = []
next_photo = combinations.delete( combinations.max {|a,b| a.size <=> b.size} )
begin
  top_set << next_photo 
  last_photo = next_photo
  next_photo = combinations.delete( combinations.max {|a,b| a.size <=> b.size} ) unless combinations.empty?
end while !combinations.empty? && (top_set.size < 5 || next_photo.size == last_photo.size)


#calculate frequency of the largest counts & sort
total_counts = top_set.inject(Hash.new {|h,k| h[k] = 0}) do |hash,combination|
  combination_counts.each{|k,v| hash[combination] += v if (combination & k) == combination}
  hash
end.sort_by { |s| [-1*s[0].length,-1*s[1]] }

total_counts[0..4].each do |l|
  puts l.inspect
end
# output for profiling.
puts "Time elapsed: #{Time.now - beginning} seconds."

Answer 3

group_by

需要1.8.7+

a = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]

groups = a .
  group_by {|e|e[0]} .
  collect do |id, photos|
    [id, photos.inject([]){|all,(id,colour)| all << colour}.sort.uniq]
  end .
  group_by {|e|e[1]}

groups.each {|colours, list| groups[colours] = list.length}
h = Hash.new {|h,k| h[k]=[0,0]}

groups.each do |colours, count|
  colours.each do |colour|
    h[colour][0] += 1  # how many times a colour appears
    h[colour][1] += count  # how many photos the colour appears in
  end
end

h.each do |colour, (n,total)|
  groups.update({[colour] => total}) if n > 1
end

groups.each {|colours, count| puts "#{count} photos for #{colours.join ','}"}

输出

2 photos for green,red
3 photos for red
1 photos for yellow

用于分组和计数颜色的Ruby算法的优化

3 个答案: