我表面上看起来似乎是一个简单的问题,我希望用ruby解决这个问题,我有一堆带有相关照片ID的颜色,例如
[[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]
我希望处理数据,使其采用以下格式:
2张照片为红色,绿色
3张照片为红色
1张黄色照片
有几点需要注意:
匹配最多颜色的照片/照片在列表中排在第一位,如果匹配的颜色数相同(如上面的红色和黄色),则先将最高数量计算在内。
红色的计数是3,因为2张照片有红色和绿色,第三张照片只有红色。我不会自己显示绿色的结果,因为所有绿色照片都用红色和绿色的条目来计算。
最终,无论数据集有多大,我都只需要显示前5个结果。
我已经编写了一个实现这一目标的算法(见下文),但我很感激任何有关如何使其更快,更优雅的指导。速度是主要考虑因素,我将操作大量数据(一百万个订单),然后如果可能的话,如果它可以变得更优雅,那将是不错的 - 我不认为我写优雅的红宝石代码,我有一个c ++背景。
我知道在ruby中嵌入c和c ++代码可以提高性能,但我真的很想用ruby实现这一点。
非常感谢
beginning = Time.now
ARR = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"red"],[4,"green"],[4,"yellow"],[5,"green"],[5,"red"],[6,"black"]]
# Group the colours by their id.
groups = ARR.group_by {|x| x[0]}
# output for profiling.
puts "After Group BY: #{Time.now - beginning} seconds."
# Remove the id's, as they are no longer useful. Sort the colours alphabetically.
sorted_groups = []
groups.each do |i,j|
sorted_groups << j.map!{ |x| x[1]}.sort
end
# Order the colours, so the group containing the most colours comes first.
# Do a secondary sort alphabetically, so that all identical groups are next to each other.
sorted_groups_in_order = sorted_groups.sort_by { |s| [s.length,s] }.reverse
# Traverse the groups in order to find the index that marks the position of results_to_return unique groups.
# This is to make subsequent processing more efficient, as it will only operate on a smaller subset.
results_to_return = 5
temp = sorted_groups_in_order[0]
combination_count = 0
index = 0
sorted_groups_in_order.each do |e|
combination_count +=1 if e != temp
break if combination_count == results_to_return
index += 1
temp = e
end
# Iterate through the subset, and count the duplicates.
tags_with_count = Hash.new(0)
sorted_groups_in_order[0..index].each do |v|
tags_with_count.store(v,tags_with_count[v]+1)
end
# Sort by the number of colours in each subset, the most colours go first.
tags_with_count = tags_with_count.sort { |q,w| w[0].size <=> q[0].size }
# if colour subsets are found in colour supersets, then increment the subset count to reflect this.
tags_with_count.reverse.each_with_index do |object,index|
tags_with_count.reverse.each_with_index do |object2,index2|
if (index2 < index) && (object[0]&object2[0] == object2[0])
object2[1] += object[1]
end
end
end
# Sort by the number of colours in each subset, the most colours go first.
# Perform a secondary sort by the count value.
tags_with_count = tags_with_count.sort_by { |s| [s[0].length,s[1]] }.reverse
# print our results.
tags_with_count.each do |l|
puts l.inspect
end
# output for profiling.
puts "Time elapsed: #{Time.now - beginning} seconds."
答案 0 :(得分:1)
查看反映修改后规范的new answer
假设你有&gt; 1.8.7,您可以使用Array.combination。否则你需要安装ruby permutation gem:
http://permutation.rubyforge.org/
然后。 。
data = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]
# get a hash mapping photo_id to colors
colors_by_photo_id = data.inject(Hash.new {|h,k| h[k] = []}) do |h,a|
h[a.first] << a.last
h
end
# could use inject here, but i think this is more readable
total_counts = Hash.new{|h,k| h[k] = 0}
# add up the sum for all combinations
colors_by_photo_id.values.each do |color_array|
1.upto(color_array.size).each do |i|
color_array.combination(i){|comb| total_counts[comb.sort] += 1}
end
end
>> total_counts
=> {["green", "red"]=>2, ["red"]=>3, ["yellow"]=>1, ["green"]=>2}
# or if you want the output sorted:
>> total_counts.to_a.sort_by{|a,c| -c}
=> [[["red"], 3], [["green", "red"], 2], [["green"], 2], [["yellow"], 1]]
答案 1 :(得分:1)
我发布了一个新答案,因为您修改了规范。
在小数据集上以1/3的时间运行,给出相同的输出。
beginning = Time.now
ARR = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"red"],[4,"green"],[4,"yellow"],[5,"green"],[5,"red"],[6,"black"]]
#assemble an array of photos, each photo being an array of sorted colors
photos = ARR.inject(Hash.new {|h,k| h[k] = []}) do |h,a|
h[a.first] << a.last
h
end.values.map{|v| v.sort!}
#count the occurrences of each combination
combination_counts = photos.uniq.inject(Hash.new(0)) {|h,comb| h[comb] = photos.count(comb); h}
#unique combinations
combinations = combination_counts.keys
#find the 5 largest combinations
top_5 = (1..[combinations.size,5].min).map do
combinations.delete( combinations.max {|a,b| a.size <=> b.size} )
end
#find the top 5, plus extras in case of ties (this replaces the above stricken code)
top_set = []
next_photo = combinations.delete( combinations.max {|a,b| a.size <=> b.size} )
begin
top_set << next_photo
last_photo = next_photo
next_photo = combinations.delete( combinations.max {|a,b| a.size <=> b.size} ) unless combinations.empty?
end while !combinations.empty? && (top_set.size < 5 || next_photo.size == last_photo.size)
#calculate frequency of the largest counts & sort
total_counts = top_set.inject(Hash.new {|h,k| h[k] = 0}) do |hash,combination|
combination_counts.each{|k,v| hash[combination] += v if (combination & k) == combination}
hash
end.sort_by { |s| [-1*s[0].length,-1*s[1]] }
total_counts[0..4].each do |l|
puts l.inspect
end
# output for profiling.
puts "Time elapsed: #{Time.now - beginning} seconds."
答案 2 :(得分:0)
group_by
a = [[1,"red"],[1,"green"],[2,"red"],[3,"yellow"],[4,"green"],[4,"red"]]
groups = a .
group_by {|e|e[0]} .
collect do |id, photos|
[id, photos.inject([]){|all,(id,colour)| all << colour}.sort.uniq]
end .
group_by {|e|e[1]}
groups.each {|colours, list| groups[colours] = list.length}
h = Hash.new {|h,k| h[k]=[0,0]}
groups.each do |colours, count|
colours.each do |colour|
h[colour][0] += 1 # how many times a colour appears
h[colour][1] += count # how many photos the colour appears in
end
end
h.each do |colour, (n,total)|
groups.update({[colour] => total}) if n > 1
end
groups.each {|colours, count| puts "#{count} photos for #{colours.join ','}"}
输出
2 photos for green,red
3 photos for red
1 photos for yellow