用于分组和计数颜色的Ruby算法的优化

时间:2010-02-10 17:31:42

标签: ruby performance algorithm optimization

我表面上看起来似乎是一个简单的问题,我希望用ruby解决这个问题,我有一堆带有相关照片ID的颜色,例如

[[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]

我希望处理数据,使其采用以下格式:

2张照片为红色,绿色
3张照片为红色
1张黄色照片

有几点需要注意:

  1. 匹配最多颜色的照片/照片在列表中排在第一位,如果匹配的颜色数相同(如上面的红色和黄色),则先将最高数量计算在内。

  2. 红色的计数是3,因为2张照片有红色和绿色,第三张照片只有红色。我不会自己显示绿色的结果,因为所有绿色照片都用红色和绿色的条目来计算。

  3. 最终,无论数据集有多大,我都只需要显示前5个结果。

  4. 我已经编写了一个实现这一目标的算法(见下文),但我很感激任何有关如何使其更快,更优雅的指导。速度是主要考虑因素,我将操作大量数据(一百万个订单),然后如果可能的话,如果它可以变得更优雅,那将是不错的 - 我不认为我写优雅的红宝石代码,我有一个c ++背景。

    我知道在ruby中嵌入c和c ++代码可以提高性能,但我真的很想用ruby实现这一点。

    非常感谢

    beginning = Time.now
    
    ARR = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"red"],[4,"green"],[4,"yellow"],[5,"green"],[5,"red"],[6,"black"]]
    
    # Group the colours by their id.
    groups = ARR.group_by {|x| x[0]}
    
    # output for profiling.
    puts "After Group BY: #{Time.now - beginning} seconds."
    
    # Remove the id's, as they are no longer useful. Sort the colours alphabetically.
    sorted_groups = []
    groups.each do |i,j|
      sorted_groups << j.map!{ |x|  x[1]}.sort
    end
    
    # Order the colours, so the group containing the most colours comes first.
    # Do a secondary sort alphabetically, so that all identical groups are next to each other. 
    sorted_groups_in_order = sorted_groups.sort_by { |s| [s.length,s] }.reverse
    
    # Traverse the groups in order to find the index that marks the position of results_to_return unique groups.
    # This is to make subsequent processing more efficient, as it will only operate on a smaller subset.
    results_to_return = 5
    temp = sorted_groups_in_order[0]
    combination_count = 0
    index = 0
    
    sorted_groups_in_order.each do |e|
     combination_count +=1 if e != temp
     break if combination_count == results_to_return
    
     index += 1
     temp = e
    end
    
    # Iterate through the subset, and count the duplicates.
    tags_with_count = Hash.new(0)
    sorted_groups_in_order[0..index].each do |v|
      tags_with_count.store(v,tags_with_count[v]+1)
    end
    
    # Sort by the number of colours in each subset, the most colours go first.
    tags_with_count = tags_with_count.sort { |q,w| w[0].size <=> q[0].size }
    
    # if colour subsets are found in colour supersets, then increment the subset count to reflect this.
    tags_with_count.reverse.each_with_index do |object,index|
      tags_with_count.reverse.each_with_index do |object2,index2|
        if (index2 < index) && (object[0]&object2[0] == object2[0])
          object2[1] += object[1]
        end
      end
    end
    
    # Sort by the number of colours in each subset, the most colours go first.
    # Perform a secondary sort by the count value.
    tags_with_count = tags_with_count.sort_by { |s| [s[0].length,s[1]] }.reverse
    
    # print our results.
    tags_with_count.each do |l|
      puts l.inspect
    end
    
    # output for profiling.
    puts "Time elapsed: #{Time.now - beginning} seconds."
    

3 个答案:

答案 0 :(得分:1)

查看反映修改后规范的new answer

假设你有&gt; 1.8.7,您可以使用Array.combination。否则你需要安装ruby permutation gem:

http://permutation.rubyforge.org/

然后。 。

data = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]

 # get a hash mapping photo_id to colors
colors_by_photo_id = data.inject(Hash.new {|h,k| h[k] = []})  do |h,a| 
     h[a.first] << a.last
     h
end

 # could use inject here, but i think this is more readable
total_counts = Hash.new{|h,k| h[k] = 0}

 # add up the sum for all combinations
colors_by_photo_id.values.each do |color_array|
  1.upto(color_array.size).each do |i|
     color_array.combination(i){|comb| total_counts[comb.sort] += 1}
  end
end

>> total_counts
=> {["green", "red"]=>2, ["red"]=>3, ["yellow"]=>1, ["green"]=>2}

 # or if you want the output sorted:
>> total_counts.to_a.sort_by{|a,c| -c}
=> [[["red"], 3], [["green", "red"], 2], [["green"], 2], [["yellow"], 1]]

答案 1 :(得分:1)

我发布了一个新答案,因为您修改了规范。

在小数据集上以1/3的时间运行,给出相同的输出。

beginning = Time.now
ARR = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"red"],[4,"green"],[4,"yellow"],[5,"green"],[5,"red"],[6,"black"]]

#assemble an array of photos, each photo being an array of sorted colors
photos = ARR.inject(Hash.new {|h,k| h[k] = []})  do |h,a| 
     h[a.first] << a.last
     h
end.values.map{|v| v.sort!}

#count the occurrences of each combination
combination_counts = photos.uniq.inject(Hash.new(0)) {|h,comb| h[comb] = photos.count(comb); h}

#unique combinations
combinations = combination_counts.keys 

#find the 5 largest combinations
top_5 = (1..[combinations.size,5].min).map do 
          combinations.delete( combinations.max {|a,b| a.size <=> b.size} )
        end

#find the top 5, plus extras in case of ties (this replaces the above stricken code)
top_set = []
next_photo = combinations.delete( combinations.max {|a,b| a.size <=> b.size} )
begin
  top_set << next_photo 
  last_photo = next_photo
  next_photo = combinations.delete( combinations.max {|a,b| a.size <=> b.size} ) unless combinations.empty?
end while !combinations.empty? && (top_set.size < 5 || next_photo.size == last_photo.size)


#calculate frequency of the largest counts & sort
total_counts = top_set.inject(Hash.new {|h,k| h[k] = 0}) do |hash,combination|
  combination_counts.each{|k,v| hash[combination] += v if (combination & k) == combination}
  hash
end.sort_by { |s| [-1*s[0].length,-1*s[1]] }

total_counts[0..4].each do |l|
  puts l.inspect
end
# output for profiling.
puts "Time elapsed: #{Time.now - beginning} seconds."

答案 2 :(得分:0)

group_by

需要1.8.7+
a = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]

groups = a .
  group_by {|e|e[0]} .
  collect do |id, photos|
    [id, photos.inject([]){|all,(id,colour)| all << colour}.sort.uniq]
  end .
  group_by {|e|e[1]}

groups.each {|colours, list| groups[colours] = list.length}
h = Hash.new {|h,k| h[k]=[0,0]}

groups.each do |colours, count|
  colours.each do |colour|
    h[colour][0] += 1  # how many times a colour appears
    h[colour][1] += count  # how many photos the colour appears in
  end
end

h.each do |colour, (n,total)|
  groups.update({[colour] => total}) if n > 1
end

groups.each {|colours, count| puts "#{count} photos for #{colours.join ','}"}

输出

2 photos for green,red
3 photos for red
1 photos for yellow