Question

我有一张图片master.png和超过10,000张其他图片（slave_1.png，slave_2.png，...）。他们都有：

相同尺寸（例如100x50像素）
格式相同（png）
相同的图片背景

98％的奴隶与主人相同，但有2％的奴隶内容略有不同：

出现新颜色
图像中间出现新的小形状

我需要发现那些不同的奴隶。我使用Ruby，但使用不同的技术我没有问题。

我尝试File.binread两个图片，然后使用==进行比较。它适用于80％的奴隶。在其他奴隶中，它发现了变化，但图像在视觉上是相同的。所以它不起作用。

替代方案是：

计算每个从站中存在的颜色数并与主站进行比较。它将在100％的时间内工作。但我不知道如何在Ruby中使用＆＃34; light＆＃34;方式。
使用某些图像处理器按照RMagick或ruby-vips8等直方图进行比较。这种方式也应该有效，但我需要消耗更少的CPU /内存。
编写C ++ / Go / Crystal程序，逐像素读取并返回多种颜色。我认为通过这种方式我们可以获得性能。但肯定是艰难的。

任何启示？建议？

Answer 1

在ruby-vips中，您可以这样做：

require 'vips'

# find normalised histogram of reference image
ref = VIPS::Image.new ARGV[0], :sequential => true
ref_hist = ref.hist.histnorm

# trigger a GC every few loops to keep memuse down
loop = 0

ARGV[1..-1].each do |filename|
    # find sample hist
    sample = VIPS::Image.new filename, :sequential => true
    sample_hist = sample.hist.histnorm

    # calculate sum of squares of differences, if it's over a threshold, print
    # the filename
    diff_hist = ref_hist.subtract(sample_hist).pow(2)
    diff = diff_hist.avg * diff_hist.x_size * diff_hist.y_size

    if diff > 100
        puts "#{filename}, #{diff}"
    end

    loop += 1
    if loop % 100 == 0
        GC.start
    end
end

偶尔GC.start是使Ruby免费并防止内存填充所必需的。虽然它每100张图片只拍摄一次，但遗憾的是，它仍然花费大量时间进行垃圾收集。

$ vips crop ~/pics/k2.jpg ref.png 0 0 100 50
$ for i in {1..10000}; do cp ref.png $i.png; done
$ time ../similarity.rb ref.png *.png
real    2m44.294s
user    7m30.696s
sys 0m20.780s
peak mem 270mb

如果你愿意考虑使用Python，它会更快，因为它确实引用了计数，并且不需要一直扫描。

import sys
from gi.repository import Vips

# find normalised histogram of reference image
ref = Vips.Image.new_from_file(sys.argv[1], access = Vips.Access.SEQUENTIAL)
ref_hist = ref.hist_find().hist_norm()

for filename in sys.argv[2:]:
    # find sample hist
    sample = Vips.Image.new_from_file(filename, access = Vips.Access.SEQUENTIAL)
    sample_hist = sample.hist_find().hist_norm()

    # calculate sum of squares of difference, if it's over a threshold, print
    # the filename
    diff_hist = (ref_hist - sample_hist) ** 2
    diff = diff_hist.avg() * diff_hist.width * diff_hist.height

    if diff > 100:
        print filename, ", ", diff

我明白了：

$ time ../similarity.py ref.png *.png
real    1m4.001s
user    1m3.508s
sys 0m10.060s
peak mem 58mb

什么是比较图像的最佳技术＆＃39;相似？

1 个答案: